Text is only part of what people work with. Documents have tables, charts, and diagrams. UIs are screenshots. Engineering artifacts are architecture diagrams. Medical records contain scanned images. Code lives inside images on Stack Overflow. A language model that cannot see is half a model.
Zen Vision closes that gap. Today we are releasing a 72B multimodal model that understands images and text together, trained from the ground up to reason across both modalities rather than treating vision as a bolt-on patch to a text model.
Architecture
Zen Vision uses a native multimodal architecture: a 72B language model backbone with a deep vision encoder coupled at every transformer layer, not just at input embedding time. This architecture decision matters.
Most "multimodal" models process the image once with a vision encoder, convert it to a flat embedding, and hand it to the language model as if it were just more tokens. This works for simple captioning tasks. It fails when the task requires spatial reasoning — understanding that a chart has a y-axis label on the left, a legend at the top right, and a data series that peaks in Q3. Shallow coupling loses that spatial structure.
In Zen Vision, the vision encoder and language backbone run in coupled attention at multiple depths. The language model can query the visual features directly, multiple times, at different levels of abstraction. High-level features ("this is a bar chart") and low-level features ("the bar at position 3 reaches pixel height 287") are available throughout reasoning.
Vision Encoder
- Parameters: 72B backbone + 2B vision encoder
- Input resolution: Up to 4096×4096 pixels, tiled automatically
- Tile size: 448×448 with 50% overlap for boundary continuity
- Max images per request: 16 images at full resolution, or mixed-resolution batches
Capabilities
OCR and Document Extraction
Zen Vision handles handwritten text, printed text, mixed scripts, and degraded scans. On the DocVQA benchmark (document visual question answering), Zen Vision scores 91.4 — above the previous Zen architecture and competitive with the best specialized OCR models.
It extracts structured data from documents: pull a table from a scanned invoice as JSON, extract line items from a purchase order, parse a handwritten form into key-value pairs. This is not template matching — it generalizes to novel document layouts.
Diagram Reasoning
Architecture diagrams, flowcharts, network topologies, ER diagrams, circuit schematics. Zen Vision reads them, describes them, and answers questions about them. Given a system architecture diagram, it can identify single points of failure. Given a flowchart, it can trace execution paths. Given a circuit schematic, it can identify the signal flow.
| Task | Zen Vision | Baseline |
|---|---|---|
| DocVQA | 91.4 | 84.2 |
| ChartQA | 88.7 | 82.1 |
| TextVQA | 82.3 | 76.8 |
| MMBench | 84.1 | 79.3 |
| OCRBench | 79.8 | 71.4 |
Screenshot Analysis
UI screenshots are a first-class use case. Zen Vision can describe what is on screen, identify UI components, extract text from rendered interfaces, answer questions about layout, and explain what a user would need to do to accomplish a task. This makes it directly useful for:
- Automated UI testing with natural language assertions
- Accessibility auditing of screenshots
- Visual regression descriptions
- Documentation generation from screenshots
Code in Images
Stack Overflow, documentation, tutorial screenshots — code lives inside images constantly. Zen Vision reads code from images accurately, including code with unusual formatting, syntax highlighting, or partial visibility.
Limitations
Zen Vision is a reasoning model, not a pixel counter. It does not replace specialized OCR pipelines for extremely high-volume structured extraction where speed and cost dominate. At 72B, it is also not a mobile model — inference requires significant GPU memory.
The 16-image limit per request is a practical constraint of current serving infrastructure, not an architectural limit. We are working on increasing this.
Usage
import hanzo
client = hanzo.Client(api_key="...")
response = client.chat.completions.create(
model="zen-vision",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "text", "text": "Extract all line items from this invoice as JSON."}
]
}]
)Available via Hanzo Cloud at api.hanzo.ai/v1/chat/completions with model zen-vision. Weights available at huggingface.co/zenlm/zen-vision.
Zach Kelling is the founder of Hanzo AI, Techstars '17.
Read more
Zen Designer: 235B Vision-Language Model
Zen Designer is a 235B MoE vision-language model with 22B active parameters, supporting image analysis, video understanding, OCR in 32 languages, and native design reasoning.
Zen Omni: Unified Multimodal AI
Zen Omni is a 30B MoE unified multimodal model with Thinker-Talker architecture, handling text, vision, and audio in a single model with real-time speech-to-speech at under 300ms latency.
Zen Audit: Code Security and Smart Contract Analysis
Zen Audit is trained on CVE databases, audit reports, and vulnerability research to provide automated code security analysis and smart contract auditing with low false positive rates.