Mistral OCR 4: Document Intelligence Developer Guide

Mistral AI released OCR 4 on June 24, 2026, upgrading its document extraction engine into a full document intelligence platform. The headline additions — bounding boxes, typed-block classification, and per-word confidence scores — transform a raw text extractor into a structured data pipeline ready for RAG systems, agentic automation, and regulated-industry compliance workflows.

This guide walks through the API from first call to a production pipeline, with Python and TypeScript examples throughout.

What Changed in OCR 4

Previous Mistral OCR versions extracted text and tables. Version 4 adds three structural features that matter for production systems:

Bounding boxes — Every extracted element now comes with pixel coordinates. This lets you highlight passages in the original document, map extracted data back to its source location for citation-grounding, and build review UIs where humans verify uncertain extractions.

Typed-block classification — Each page is segmented into 13 block types: text, title, list, table, image, equation, caption, code, references, aside_text, header, footer, and signature. Downstream applications can filter by type, route equations to a LaTeX renderer, or extract only tables for spreadsheet export — without any additional parsing.

Inline confidence scores — Available at page level or word level. Any word scoring below your threshold triggers a human-in-the-loop review step rather than propagating silent errors into downstream records.

The model now covers 170 languages across 10 language groups, including dedicated support for Arabic and other Middle Eastern scripts, making it a strong fit for multilingual document workflows in the MENA region.

Setup

Install the Python SDK or the Node.js package:

pip install mistralai
npm install @mistralai/mistralai

Set your API key as an environment variable:

export MISTRAL_API_KEY="your_key_here"

Basic OCR Extraction

The simplest call processes a PDF by URL and returns structured markdown:

Python

import os
from mistralai import Mistral
 
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
 
result = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://example.com/invoice.pdf"
    },
    table_format="markdown"
)
 
for page in result.pages:
    print(f"Page {page.index}:\n{page.markdown}\n")

TypeScript

import Mistral from "@mistralai/mistralai";
 
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY! });
 
const result = await client.ocr.process({
  model: "mistral-ocr-latest",
  document: {
    type: "document_url",
    documentUrl: "https://example.com/invoice.pdf"
  },
  tableFormat: "markdown"
});
 
for (const page of result.pages) {
  console.log(`Page ${page.index}:\n${page.markdown}\n`);
}

For local files, upload them first via the files API, then reference the returned URL:

with open("contract.pdf", "rb") as f:
    upload = client.files.upload(
        file={"file_name": "contract.pdf", "content": f},
        purpose="ocr"
    )
 
result = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url", "document_url": upload.url}
)

Block Extraction

Set include_blocks=True to receive a segmented list of typed elements per page, each with its bounding box and text content:

result = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url", "document_url": url},
    include_blocks=True
)
 
for page in result.pages:
    for block in page.blocks:
        if block.type == "equation":
            print(f"Equation at {block.bounding_box}: {block.text}")
        elif block.type == "table":
            print(f"Table at {block.bounding_box}")
        elif block.type == "signature":
            print(f"Signature detected at {block.bounding_box}")

Bounding box coordinates are normalized relative to page dimensions, so they translate cleanly to any rendering resolution. This makes it straightforward to overlay boxes on the original document in a PDF viewer or web UI.

Confidence Scores

Word-level confidence scores let you build selective human-review pipelines. Flag uncertain pages before downstream processing:

result = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url", "document_url": url},
    confidence_scores_granularity="page"
)
 
THRESHOLD = 0.80
 
for page in result.pages:
    page_score = page.confidence_scores.get("overall", 1.0)
    if page_score < THRESHOLD:
        print(f"Page {page.index} flagged for review: confidence {page_score:.2f}")

For document types where accuracy is critical — legal contracts, medical records, financial statements — this pattern ensures low-quality scans are caught at intake rather than discovered months later in an audit.

Document AI Mode: Structured JSON Output

OCR 4's Document AI mode accepts a custom JSON schema and returns structured output directly, eliminating the need for a separate LLM extraction step. This is the fastest path from PDF to database record:

invoice_schema = {
    "type": "object",
    "properties": {
        "vendor_name": {"type": "string"},
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"},
        "currency": {"type": "string"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"}
                }
            }
        }
    }
}
 
result = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url", "document_url": invoice_url},
    document_annotation_schema=invoice_schema
)
 
data = result.document_annotation
print(f"Invoice #{data['invoice_number']}: {data['total_amount']} {data['currency']}")

In TypeScript:

const result = await client.ocr.process({
  model: "mistral-ocr-latest",
  document: { type: "document_url", documentUrl: invoiceUrl },
  documentAnnotationSchema: invoiceSchema
});
 
const data = result.documentAnnotation;
console.log(`Invoice #${data.invoiceNumber}: ${data.totalAmount} ${data.currency}`);

Document AI mode costs $5 per 1,000 pages — one dollar more than Pure OCR — but eliminates an entire LLM inference step in most extraction pipelines, making it cheaper end-to-end for structured-output use cases.

Self-Hosting for Data Sovereignty

OCR 4 ships as a single container for organizations that cannot route sensitive documents to third-party cloud endpoints. This matters for:

Financial institutions in the MENA region subject to INPDP (Tunisia) or PDPL (Saudi Arabia) data residency rules
Healthcare providers handling patient records under local health data regulations
Legal firms processing privileged documents
Government agencies requiring on-premises document processing

The container handles inference locally, so no document content leaves your network. Contact Mistral's enterprise team for the container image and licensing terms.

For cloud deployments that still need regional control, OCR 4 is available on Amazon SageMaker and Microsoft Azure Foundry, with Snowflake integration coming soon.

Pricing

Mode	Standard API	Batch API
Pure OCR	$4 per 1,000 pages	$2 per 1,000 pages
Document AI	$5 per 1,000 pages	$2.50 per 1,000 pages

For high-volume workflows — digitizing document archives, processing thousands of invoices daily — the Batch API at $2 per 1,000 pages is competitive with traditional OCR providers. Independent users report OCR 4 running roughly 4x faster per page than legacy enterprise solutions at lower cost.

Complete Pipeline Example

A minimal invoice-processing pipeline combining confidence gating with structured extraction:

def process_invoice(pdf_url: str) -> dict | None:
    client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
 
    # Pass 1: Quick confidence check
    ocr = client.ocr.process(
        model="mistral-ocr-latest",
        document={"type": "document_url", "document_url": pdf_url},
        confidence_scores_granularity="page"
    )
 
    scores = [p.confidence_scores.get("overall", 1.0) for p in ocr.pages]
    avg_confidence = sum(scores) / len(scores)
 
    if avg_confidence < 0.75:
        queue_for_human_review(pdf_url)
        return None
 
    # Pass 2: Structured extraction on high-confidence documents only
    result = client.ocr.process(
        model="mistral-ocr-latest",
        document={"type": "document_url", "document_url": pdf_url},
        document_annotation_schema=invoice_schema
    )
 
    return result.document_annotation

The two-pass approach keeps Document AI costs low by running structured extraction only on documents where OCR quality is sufficient. Low-confidence documents go to a human review queue instead of generating unreliable structured output.

Benchmarks

OCR 4 achieves a 72% win rate over competing solutions in blind human-preference tests covering more than 600 real-world documents across more than 12 languages. On automated benchmarks:

OlmOCRBench: 85.20 (top score in category)
OmniDocBench: 93.07
Crawl Multilingual: 0.98 across all 10 language groups

Next Steps

Mistral OCR 4 is available today via the Mistral API console, Amazon SageMaker, and Microsoft Azure Foundry. The official Getting Started Cookbook includes complete bounding box and block classification workflow examples. A production webinar covering high-throughput batch patterns and RAG pipeline integration runs on July 7, 2026.

For teams already using Mistral Small 4 for text generation, OCR 4 adds a document intake layer that works with the same SDK and API key — enabling end-to-end document understanding pipelines without adding a new vendor to your stack.