Text-only AI misses most of commerce. Products are visual. Reviews are textual. Behavior is structured data. Today we are launching multi-modal AI capabilities that understand all three.
The Modality Gap
Traditional commerce AI operates in silos:
- Text models: Process descriptions, reviews, search queries
- Vision models: Analyze product images
- Tabular models: Predict from structured data
Each modality provides partial understanding. Combining them has required manual feature engineering.
Unified Multi-Modal Models
Our new multi-modal architecture processes all modalities jointly:
Input: Product image + Description + Sales data + Reviews
↓
┌─────────────────────────────────────────┐
│ Multi-Modal Encoder │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Vision │ │Language │ │Tabular │ │
│ │Encoder │ │Encoder │ │Encoder │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └──────────┼──────────┘ │
│ ↓ │
│ Cross-Modal Attention │
│ ↓ │
│ Unified Representation │
└─────────────────────────────────────────┘
↓
Output: Predictions, embeddings, generationsThe model learns relationships across modalities: which visual features correlate with positive reviews, how descriptions affect conversion, what behavioral patterns indicate purchase intent.
Applications
Visual Search
"Find products that look like this":
results = hanzo.search.visual(
image=uploaded_image,
filters={"category": "furniture", "price_max": 500}
)Returns products visually similar to the query image, combining visual understanding with structured filtering.
Product Understanding
Automatic analysis of product listings:
analysis = hanzo.products.analyze(product_id="prod_123")
# Returns:
{
"visual_attributes": ["modern", "minimalist", "wood"],
"sentiment_summary": "Customers praise quality, some note assembly difficulty",
"competitive_position": "premium segment, unique design",
"optimization_suggestions": [
"Add lifestyle images showing scale",
"Mention included tools in description"
]
}Content Generation
Generate descriptions from images:
description = hanzo.content.generate(
image=product_image,
style="professional",
include=["materials", "dimensions", "use_cases"]
)Generate images from descriptions (coming soon):
images = hanzo.content.visualize(
description="Modern wooden desk, minimalist design",
style="product_photo",
variations=4
)Review Analysis
Understand reviews in context of product:
insights = hanzo.reviews.analyze(
product_id="prod_123",
reviews=reviews,
product_images=images
)
# Identifies: "Reviews mentioning 'smaller than expected' correlate with
# main image lacking size reference. Recommend adding comparison image."Technical Approach
Architecture
We use a transformer-based architecture with:
- Modality-specific encoders: Pretrained on large datasets
- Cross-modal attention: Learns modality relationships
- Unified decoder: Generates outputs in any modality
Training
Models trained on:
- 100M+ product images
- 500M+ product descriptions
- 1B+ customer interactions
- 50M+ reviews
Contrastive learning aligns modalities in shared embedding space.
Inference
Optimized for production:
- Quantized models for efficiency
- Batch processing for throughput
- Caching for repeated queries
Privacy
Multi-modal analysis respects privacy:
- Customer data processed in aggregate
- No individual behavior in model weights
- Review analysis anonymizes authors
API Access
Multi-modal capabilities available through existing APIs:
POST /v1/products/analyze
POST /v1/search/visual
POST /v1/content/generate
POST /v1/reviews/analyzeWhat's Next
- Audio modality (voice commerce)
- Video understanding (product demos)
- 3D model generation
- Real-time multi-modal search
Commerce is multi-modal. AI for commerce should be too.
Zach Kelling is the founder of Hanzo Industries.
Read more
Zen VL: Vision-Language Models with Function Calling
Zen VL is a family of vision-language models at 4B, 8B, and 30B -- each with instruct and agent variants -- supporting OCR in 32 languages, GUI navigation, spatial grounding, and native function calling with visual context.
Zen Designer: 235B Vision-Language Model
Zen Designer is a 235B MoE vision-language model with 22B active parameters, supporting image analysis, video understanding, OCR in 32 languages, and native design reasoning.
Zen Vision: Multimodal Understanding at 72B Scale
Zen Vision brings 72B-parameter visual understanding to the Zen model family, with strong performance on OCR, diagram reasoning, screenshot analysis, and document extraction.