zoo/ blog
Back to all articles
aimodelszenmultimodalaudiospeechlaunchzen-mode

Zen Omni: Unified Multimodal AI

Zen Omni is a 30B MoE unified multimodal model with Thinker-Talker architecture, handling text, vision, and audio in a single model with real-time speech-to-speech at under 300ms latency.

Zen Omni is a 30B MoE model that handles text, vision, and audio in a single unified model. The architecture is called Thinker-Talker: a shared reasoning backbone that branches into modality-specific output heads for text and speech generation.

Active parameters: 3B per forward pass. Sub-300ms speech-to-speech latency.

Thinker-Talker Architecture

Most multimodal systems chain separate models: a speech recognizer feeds into a language model that feeds into a text-to-speech engine. Each handoff adds latency, loses context, and introduces a seam where errors compound.

Zen Omni eliminates the handoffs. The Thinker is a shared MoE backbone that processes all modalities in a unified token space. Text tokens, image tokens, and audio tokens flow through the same attention layers. The Talker is a small output head attached to the Thinker that generates speech directly from the same hidden states that produce text.

This means:

  • The model hears your voice and understands your intent directly, without a transcription step
  • It speaks its response without converting text to audio after the fact
  • Emotional tone and prosody in input speech influence the generated response
  • Visual context informs both text and speech outputs simultaneously

Specifications

PropertyValue
Total parameters30B
Active parameters3B
ArchitectureMoE (Thinker-Talker)
Text context128K tokens
Audio context10 minutes
Speech-to-speech latency<300ms
Languages (text)100+
Languages (speech)58

Real-Time Speech-to-Speech

Under 300ms latency for speech-to-speech exchange makes Zen Omni suitable for real-time voice applications:

  • Voice assistants with natural conversational rhythm
  • Real-time voice translation (speech in, speech out, same or different language)
  • Interactive voice response (IVR) with genuine language understanding
  • Accessibility tools for vision-impaired users

300ms is below the conversational latency threshold where pauses become uncomfortable. It is achievable because the Talker generates speech tokens in parallel with the Thinker's reasoning, not sequentially after.

Zen Dub Integration

Zen Omni integrates with Zen Dub, our voice cloning and audio generation model. When Zen Dub is paired with Zen Omni, the speech output can be delivered in a cloned voice rather than the default model voice.

Use cases:

  • Branded AI assistants with a consistent voice identity
  • Audio content localization that preserves the original speaker's voice
  • Personalized voice interfaces

Input and Output Modalities

Input: Text, images, video (up to 5 minutes), audio (up to 10 minutes), interleaved multimodal sequences

Output: Text, speech audio, structured data

The model handles interleaved inputs naturally: a conversation can include text messages, attached images, voice memos, and video clips, all in a single context window. The model understands relationships across all of them.

Get Zen Omni

  • HuggingFace: huggingface.co/zenlm
  • Hanzo Cloud API: api.hanzo.ai/v1/chat/completions -- model zen-omni, voice output via audio response format
  • Zen LM: zenlm.org -- audio API documentation

Zach Kelling is the founder of Hanzo AI, Techstars '17.