TL;DR
DeepSeek-OCR is the most recent open-weight OCR mannequin from DeepSeek, constructed to extract structured textual content, formulation, and tables from advanced paperwork with excessive accuracy. It combines a imaginative and prescient encoder (primarily based on SAM and CLIP) and a Combination-of-Specialists decoder (DeepSeek-3B-MoE) for environment friendly textual content era.
You possibly can attempt DeepSeek-OCR straight on Clarifai — no separate API key or setup required.
- 
Playground: Check DeepSeek-OCR straight within the Clarifai Playground right here. 
- 
API Entry: Use Clarifai’s OpenAI-compatible endpoint. Authenticate together with your Private Entry Token (PAT) and specify the DeepSeek-OCR mannequin URL. 
Introduction
DeepSeek-OCR is a multi-modal mannequin designed to transform advanced photographs resembling invoices, scientific papers, and handwritten notes into correct, structured textual content.
Not like conventional OCR methods that rely purely on convolutional networks for textual content detection and recognition, DeepSeek-OCR makes use of a transformer-based encoder-decoder structure. This enables it to deal with dense paperwork, tables, and combined visible content material extra successfully whereas holding GPU utilization low.
Key options:
- 
Processes photographs as imaginative and prescient tokens utilizing a hybrid SAM + CLIP encoder. 
- 
Compresses visible knowledge by as much as 10× with minimal accuracy loss. 
- 
Makes use of a 3B-parameter Combination-of-Specialists decoder, activating solely 6 of 64 consultants throughout inference for prime effectivity. 
- 
Can course of as much as 200K pages per day on a single A100 GPU attributable to its optimized token compression and activation technique. 
Run DeepSeek-OCR
You possibly can entry DeepSeek-OCR in two easy methods: via the Clarifai Playground or by way of the API.
Playground
The Playground offers a quick, interactive atmosphere to check and discover mannequin habits. You possibly can choose the DeepSeek-OCR mannequin straight from the neighborhood, add a picture resembling an bill, scanned doc, or handwritten web page, and add a related immediate describing what you need the mannequin to extract or analyze. The output textual content is displayed in actual time, permitting you to rapidly confirm accuracy and formatting.

DeepSeek-OCR by way of API
Clarifai offers an OpenAI-compatible endpoint that permits you to name DeepSeek-OCR utilizing the identical Python or TypeScript consumer libraries you already use. When you set your Private Entry Token (PAT) as an atmosphere variable, you possibly can name the mannequin straight by specifying its URL.
Beneath are two methods to ship a picture enter — both from a neighborhood file or by way of a picture URL.
Possibility 1: Utilizing a Native Picture File
This instance reads a neighborhood file (e.g., doc.jpeg), encodes it in base64, and sends it to the mannequin for OCR extraction.
Possibility 2: Utilizing an Picture URL
In case your picture is hosted on-line, you possibly can straight move its URL to the mannequin.
You should utilize Clarifai’s OpenAI-compatible API with any TypeScript or JavaScript SDK. For instance, the snippet under reveals how you should use the Vercel AI SDK to entry the DeepSeek-OCR.
Possibility 1: Utilizing a Native Picture File
Possibility 2: Utilizing an Picture URL
Clarifai’s OpenAI-compatible API allows you to entry DeepSeek-OCR utilizing any language or SDK that helps the OpenAI format. You possibly can experiment within the Clarifai Playground or combine it straight into your purposes. Be taught extra in regards to the Open AI Compatabile API within the documentation right here.
How DeepSeek-OCR Works
DeepSeek-OCR is constructed from the bottom up utilizing a two-stage vision-language structure that mixes a robust imaginative and prescient encoder and a Combination-of-Specialists (MoE) textual content decoder. This setup allows environment friendly and correct textual content extraction from advanced paperwork.

Picture Supply: DeepSeek-OCR Analysis Paper
DeepEncoder (Imaginative and prescient Encoder)
The DeepEncoder is a 380M-parameter imaginative and prescient spine that transforms uncooked photographs into compact visible embeddings.
- 
Patch Embedding: The enter picture is split into 16×16 patches. 
- 
Native Consideration (SAM – ViTDet): 
 SAM applies native consideration to seize fine-grained options resembling font type, handwriting, edges, and texture particulars inside every area of the picture. This helps protect spatial precision at a neighborhood stage.
- 
Downsampling: The patch embeddings are downsampled 16× by way of convolution to scale back the entire variety of visible tokens and enhance effectivity. 
- 
World Consideration (CLIP – ViT): 
 CLIP introduces international consideration, enabling the mannequin to know doc structure, construction, and semantic relationships throughout sections of the picture.
- 
Compact Visible Embeddings: 
 The encoder produces a sequence of imaginative and prescient tokens which might be roughly 10× smaller than equal textual content tokens, leading to excessive compression and quicker decoding.
DeepSeek-3B-MoE Decoder
The encoded visible tokens are handed to a Combination-of-Specialists Transformer Decoder, which converts them into readable textual content.
- 
Professional Activation: 6 out of 64 consultants are activated per token, together with 2 shared consultants (about 570M lively parameters). 
- 
Textual content Era: Transformer layers decode the visible embeddings into structured textual content sequences, capturing plain textual content, formulation, tables, and structure info. 
- 
Effectivity and Scale: Though the entire mannequin measurement is 3B parameters, solely a fraction is lively throughout inference, offering 3B-scale efficiency at <600M lively price. 
Conclusion
DeepSeek-OCR is greater than a breakthrough in doc understanding. It redefines how multimodal fashions course of visible info by combining SAM’s fine-grained visible precision, CLIP’s international structure reasoning, and a Combination-of-Specialists decoder for environment friendly textual content era. By Clarifai, you possibly can experiment DeepSeek-OCR within the Playground, combine it straight by way of the OpenAI-compatible API.
Be taught extra:

