Zen Omni is a 30B MoE model that handles text, vision, and audio in a single unified model. The architecture is called Thinker-Talker: a shared reasoning backbone that branches into modality-specific output heads for text and speech generation.
Active parameters: 3B per forward pass. Sub-300ms speech-to-speech latency.
Thinker-Talker Architecture
Most multimodal systems chain separate models: a speech recognizer feeds into a language model that feeds into a text-to-speech engine. Each handoff adds latency, loses context, and introduces a seam where errors compound.
Zen Omni eliminates the handoffs. The Thinker is a shared MoE backbone that processes all modalities in a unified token space. Text tokens, image tokens, and audio tokens flow through the same attention layers. The Talker is a small output head attached to the Thinker that generates speech directly from the same hidden states that produce text.
This means:
- The model hears your voice and understands your intent directly, without a transcription step
- It speaks its response without converting text to audio after the fact
- Emotional tone and prosody in input speech influence the generated response
- Visual context informs both text and speech outputs simultaneously
Specifications
| Property | Value |
|---|---|
| Total parameters | 30B |
| Active parameters | 3B |
| Architecture | MoE (Thinker-Talker) |
| Text context | 128K tokens |
| Audio context | 10 minutes |
| Speech-to-speech latency | <300ms |
| Languages (text) | 100+ |
| Languages (speech) | 58 |
Real-Time Speech-to-Speech
Under 300ms latency for speech-to-speech exchange makes Zen Omni suitable for real-time voice applications:
- Voice assistants with natural conversational rhythm
- Real-time voice translation (speech in, speech out, same or different language)
- Interactive voice response (IVR) with genuine language understanding
- Accessibility tools for vision-impaired users
300ms is below the conversational latency threshold where pauses become uncomfortable. It is achievable because the Talker generates speech tokens in parallel with the Thinker's reasoning, not sequentially after.
Zen Dub Integration
Zen Omni integrates with Zen Dub, our voice cloning and audio generation model. When Zen Dub is paired with Zen Omni, the speech output can be delivered in a cloned voice rather than the default model voice.
Use cases:
- Branded AI assistants with a consistent voice identity
- Audio content localization that preserves the original speaker's voice
- Personalized voice interfaces
Input and Output Modalities
Input: Text, images, video (up to 5 minutes), audio (up to 10 minutes), interleaved multimodal sequences
Output: Text, speech audio, structured data
The model handles interleaved inputs naturally: a conversation can include text messages, attached images, voice memos, and video clips, all in a single context window. The model understands relationships across all of them.
Get Zen Omni
- HuggingFace: huggingface.co/zenlm
- Hanzo Cloud API:
api.hanzo.ai/v1/chat/completions-- modelzen-omni, voice output viaaudioresponse format - Zen LM: zenlm.org -- audio API documentation
Zach Kelling is the founder of Hanzo AI, Techstars '17.
Read more
Zen Designer: 235B Vision-Language Model
Zen Designer is a 235B MoE vision-language model with 22B active parameters, supporting image analysis, video understanding, OCR in 32 languages, and native design reasoning.
Zen Vision: Multimodal Understanding at 72B Scale
Zen Vision brings 72B-parameter visual understanding to the Zen model family, with strong performance on OCR, diagram reasoning, screenshot analysis, and document extraction.
Zen Audit: Code Security and Smart Contract Analysis
Zen Audit is trained on CVE databases, audit reports, and vulnerability research to provide automated code security analysis and smart contract auditing with low false positive rates.