zoo/ blog
Back to all articles
multi-modalaivisioncommerce

Multi-Modal AI for Commerce

How we are combining vision, language, and structured data for next-generation commerce AI.

Text-only AI misses most of commerce. Products are visual. Reviews are textual. Behavior is structured data. Today we are launching multi-modal AI capabilities that understand all three.

The Modality Gap

Traditional commerce AI operates in silos:

  • Text models: Process descriptions, reviews, search queries
  • Vision models: Analyze product images
  • Tabular models: Predict from structured data

Each modality provides partial understanding. Combining them has required manual feature engineering.

Unified Multi-Modal Models

Our new multi-modal architecture processes all modalities jointly:

Input: Product image + Description + Sales data + Reviews

┌─────────────────────────────────────────┐
│         Multi-Modal Encoder             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐   │
│  │ Vision  │ │Language │ │Tabular  │   │
│  │Encoder  │ │Encoder  │ │Encoder  │   │
│  └────┬────┘ └────┬────┘ └────┬────┘   │
│       └──────────┼──────────┘          │
│                  ↓                      │
│         Cross-Modal Attention           │
│                  ↓                      │
│         Unified Representation          │
└─────────────────────────────────────────┘

Output: Predictions, embeddings, generations

The model learns relationships across modalities: which visual features correlate with positive reviews, how descriptions affect conversion, what behavioral patterns indicate purchase intent.

Applications

"Find products that look like this":

results = hanzo.search.visual(
    image=uploaded_image,
    filters={"category": "furniture", "price_max": 500}
)

Returns products visually similar to the query image, combining visual understanding with structured filtering.

Product Understanding

Automatic analysis of product listings:

analysis = hanzo.products.analyze(product_id="prod_123")

# Returns:
{
  "visual_attributes": ["modern", "minimalist", "wood"],
  "sentiment_summary": "Customers praise quality, some note assembly difficulty",
  "competitive_position": "premium segment, unique design",
  "optimization_suggestions": [
    "Add lifestyle images showing scale",
    "Mention included tools in description"
  ]
}

Content Generation

Generate descriptions from images:

description = hanzo.content.generate(
    image=product_image,
    style="professional",
    include=["materials", "dimensions", "use_cases"]
)

Generate images from descriptions (coming soon):

images = hanzo.content.visualize(
    description="Modern wooden desk, minimalist design",
    style="product_photo",
    variations=4
)

Review Analysis

Understand reviews in context of product:

insights = hanzo.reviews.analyze(
    product_id="prod_123",
    reviews=reviews,
    product_images=images
)

# Identifies: "Reviews mentioning 'smaller than expected' correlate with
# main image lacking size reference. Recommend adding comparison image."

Technical Approach

Architecture

We use a transformer-based architecture with:

  • Modality-specific encoders: Pretrained on large datasets
  • Cross-modal attention: Learns modality relationships
  • Unified decoder: Generates outputs in any modality

Training

Models trained on:

  • 100M+ product images
  • 500M+ product descriptions
  • 1B+ customer interactions
  • 50M+ reviews

Contrastive learning aligns modalities in shared embedding space.

Inference

Optimized for production:

  • Quantized models for efficiency
  • Batch processing for throughput
  • Caching for repeated queries

Privacy

Multi-modal analysis respects privacy:

  • Customer data processed in aggregate
  • No individual behavior in model weights
  • Review analysis anonymizes authors

API Access

Multi-modal capabilities available through existing APIs:

POST /v1/products/analyze
POST /v1/search/visual
POST /v1/content/generate
POST /v1/reviews/analyze

What's Next

  • Audio modality (voice commerce)
  • Video understanding (product demos)
  • 3D model generation
  • Real-time multi-modal search

Commerce is multi-modal. AI for commerce should be too.


Zach Kelling is the founder of Hanzo Industries.