Music is the universal language. Zen Musician speaks it fluently - generating original compositions that capture emotion, style, and musicality.
The Music Generation Problem
Music isn't just organized sound. It's:
- Emotional: Music evokes feelings that are hard to quantify
- Structural: Songs have verses, choruses, bridges, builds
- Stylistic: Genres have conventions that define them
- Cultural: Music carries meaning beyond its notes
Generating convincing music requires understanding all of these dimensions.
How Zen Musician Works
Multi-Track Architecture
┌─────────────────────────────────────────┐
│ Composition Model │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Drums│ │Bass │ │Chord│ │Lead │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │ │
│ └───────┴───────┴───────┘ │
│ ↓ │
│ ┌───────────────┐ │
│ │ Mixer/FX │ │
│ └───────┬───────┘ │
│ ↓ │
│ ┌───────────────┐ │
│ │ Audio Output │ │
│ └───────────────┘ │
└─────────────────────────────────────────┘Each instrument track is generated separately, then combined with mixing and effects.
Generation Modes
Text-to-Music:
Prompt: "Upbeat indie rock song, driving guitar riff,
energetic drums, 120 BPM, 3 minutes"
Output: Full multi-track compositionMelody Continuation:
Input: 8-bar melody
Prompt: "Continue in the style of jazz, add walking bass"
Output: Extended composition building on inputStyle Transfer:
Input: Classical piece
Prompt: "Transform to electronic/synthwave style"
Output: Same composition in new genreAccompaniment:
Input: Vocal track
Prompt: "Generate backing music, acoustic ballad style"
Output: Instrumental arrangement matching the vocalsModel Architecture
Hierarchical Generation: Structure → Harmony → Melody → Performance
Symbolic + Audio: Generate MIDI for control, then synthesize to audio.
Genre Embeddings: 500+ style vectors for fine-grained control.
Music Theory Constraints: Enforce valid progressions and voice leading.
Capabilities
| Feature | Capability |
|---|---|
| Duration | 30 sec to 10+ minutes |
| Tracks | Up to 16 simultaneous |
| Styles | 500+ genres/subgenres |
| Output | MIDI, WAV, MP3, stems |
| Control | BPM, key, mood, energy |
Example Generations
Lo-fi Hip Hop:
track = zen_musician.generate(
style="lofi_hiphop",
mood="relaxed",
bpm=75,
key="Dm",
duration=180,
include=["vinyl_crackle", "piano", "drums"]
)Epic Orchestral:
track = zen_musician.generate(
style="epic_orchestral",
mood="triumphant",
bpm=100,
key="C",
duration=240,
structure=["intro", "build", "climax", "resolution"]
)Ambient Electronic:
track = zen_musician.generate(
style="ambient_electronic",
mood="ethereal",
duration=600, # 10 minutes
evolving=True, # Gradual changes over time
complexity="minimal"
)Stem Generation
Zen Musician outputs individual stems for mixing:
stems = zen_musician.generate_stems(
style="rock",
stems=["drums", "bass", "rhythm_guitar", "lead_guitar", "vocals_synth"]
)
# Access individual tracks
stems.drums.export("drums.wav")
stems.bass.export("bass.wav")Producers can use generated stems as starting points, replacing or modifying individual elements.
Emotional Intelligence
Music expresses emotion. We trained on:
- Mood-labeled datasets: Human-annotated emotional content
- Film scores: Music designed to evoke specific feelings
- Listener feedback: How people describe how music makes them feel
The model understands prompts like "bittersweet", "nostalgic", "triumphant", "anxious".
Use Cases
Content Creators: Royalty-free music for videos.
Game Developers: Dynamic soundtracks that adapt to gameplay.
Film/TV: Temp tracks and final compositions.
Musicians: Starting points for new compositions.
Podcasters: Intro/outro music, background scoring.
Meditation/Wellness: Personalized ambient soundscapes.
Ethical Approach
No Artist Mimicry: We don't clone specific artists' styles.
Attribution: Generated music is clearly AI-generated.
Licensing: All output is royalty-free for commercial use.
Training Data: Properly licensed training material only.
Technical Performance
| Metric | Value |
|---|---|
| Generation Speed | 10x real-time |
| Audio Quality | 44.1kHz, 24-bit |
| Latency (streaming) | <500ms |
| Style Accuracy | 89% human rating |
What's Next
Live Performance: Real-time generation responding to musician input.
Lyrics Integration: Generate music that matches lyrical content.
Collaborative: AI jamming with human musicians.
Personalization: Learn your style, generate music you'll love.
Music has always been about human expression. Zen Musician is a new instrument - one that everyone can play.
This post is part of our retrospective series exploring the technical foundations of Hanzo and Zen.
Read more
Zen Artist: Image Generation Reimagined
Introducing Zen Artist - our multimodal image generation model with unprecedented control and quality.
Zen Omni: Unified Multimodal AI
Zen Omni is a 30B MoE unified multimodal model with Thinker-Talker architecture, handling text, vision, and audio in a single model with real-time speech-to-speech at under 300ms latency.
Zen Video: Generative Video at Scale
Introducing Zen Video - our generative video model for creating high-quality video content from text and images.