New from Alibaba · Now on Segmind

HappyHorse 1.0.
Cinema. In Every Language.

Alibaba's flagship video model. Built to challenge Seedance 2.0.1080p with native audio, 7-language lip-sync, multi-shot up to 15s. From $0.175 per second.

What Makes It Cinematic

Cinematic. Multilingual.
#1 on three of four arena tracks.

Cinema Challenger

Cinema-grade output,
Alibaba's answer to Seedance 2.0.

15-billion-parameter unified Transformer. Beats Seedance 2.0 on three of four Artificial Analysis Video Arena tracks, while undercutting it on per-second pricing.

#1 ON ARENA
ALIBABA AI
Native Audio-Video

Voice. Ambient. Foley.
One forward pass.

Veo 3.1 and Kling 3.0 Pro charge 50 to 100% more per second to add audio. HappyHorse generates video and synced audio in one Transformer pass, at no extra cost.

Video
Dialogue
Foley
Audio Premium / secvs base price
HappyHorse 1.0
Included
Kling 3.0 Pro
+50%
Veo 3.1
+100%
7-Language Lip-Sync

Phoneme-level. Out of the box.

Lowest reported word-error rate of any production video model, at ~14.6% across all seven supported languages.

MandarinCantoneseEnglishJapaneseKoreanGermanFrench
Real Pricing

From $0.175/sec

Run 100 variants for the cost of one studio reshoot. No subscriptions, no minimums.

720p
$0.175/sec
1080p
$0.30/sec
Multi-Shot · 1080p · 15s

One prompt.
Multiple cuts. Native 1080p.

Up to 15 seconds with character continuity across shot changes. The model decides coverage automatically.

W
CU
RS
M
Wide
Close-up
Reverse
Medium

How It Works

Brief to cinematic spot.
For pocket change per second.

01

Direct & Reference

Prompt + up to 9 references.

Start from a text prompt, an image, or a multi-asset brief: faces, environments, palettes, voice samples. One unified endpoint, /v1/happyhorse, accepts every combination of inputs.

Reference images lock identity and style. Voice references guide the dialogue. The model figures out the role of each asset.

Drop references
9 images · 3 videos · voice sample
@face.png@style.jpg@voice.mp3@motion.mp4
02

Generate Sound + Vision

One forward pass. ~80–160 seconds on Segmind.

A unified single-stream Transformer denoises text, image, video, and audio tokens together. Eight DMD-2 steps, classifier-free guidance disabled.

1080p, 24 fps, multi-shot, dialogue and Foley aligned to the frame. No separate audio model, no post-production stitch.

Inference timeline
Token sequence
Video stream
Audio stream
8 denoising steps · no CFG
~80–160s @ 1080p
03

Edit, Localize, Iterate

20+ variants for the cost of one studio shot.

Pass a source video plus reference images to swap a product, wardrobe, or background while preserving motion and timing. Re-lipsync existing footage to any of 7 languages without reshooting.

At $0.30 per second, you can run a dozen creative variants for a single ad spot's traditional production budget.

Variant cost ledger
EN
Hook A · English
$1.50
ZH
Hook A · Mandarin
$1.50
JA
Hook B · Japanese
$1.50
3 variants · $4.50 total · 5-second 1080p

Use Cases

Built for commerce.
Cinematic by default.

E-Commerce
Product video at catalogue scale

Product video at catalogue scale

Image-to-video PDP showcases with push-ins, soft light sweeps, matched Foley. Batch-generate 20 variants from a single hero shot.

Multilingual Ads
One creative. Seven languages.

One creative. Seven languages.

Native phoneme-level lip-sync in Mandarin, Cantonese, English, Japanese, Korean, German, and French. No dub-and-relipsync passes.

AI Hosts
Persistent live-commerce presenters

Persistent live-commerce presenters

Subject-to-video with up to 9 reference images. Identity preserved across cuts and languages, perfect for branded virtual hosts and mascots.

Short-Form Social
Vertical video with native audio

Vertical video with native audio

9:16 generations for TikTok, Reels, Shorts, Douyin, and Xiaohongshu. Dialogue and ambient sound in the same pass, no separate dubbing pipeline.

Brand Films
Cinematic spots, dialogue and all

Cinematic spots, dialogue and all

Multi-shot brand films with dialogue, ambient sound, and Foley produced together. Production quality formerly locked behind closed APIs.

Edit & Replace
Swap product, wardrobe, or background

Swap product, wardrobe, or background

Video-to-video and subject-and-video-to-video edits change on-screen elements while preserving camera moves, motion, and timing. No reshoot.

Under the Hood

Built different.
Architecturally speaking.

Unified Token Sequence

Text, image, video, and audio tokens are jointly denoised in one Transformer. No separate audio model, no cross-attention. The model learns alignment, not stitching.

DMD-2 Distillation

Distribution Matching Distillation v2 cuts inference to 8 denoising steps and runs without classifier-free guidance. Far below the 25 to 50 steps typical for diffusion video.

Per-Head Modality Gates

Learned scalar gates dampen destructive cross-modal gradients during joint training, keeping the audio and video branches from competing for capacity. Reported in technical notes, not yet peer-reviewed.

Technical Specifications

Architecture
Unified Single-Stream Transformer
15B parameters, 40 layers, no cross-attention modules
Tighter cross-modal alignment
Resolution
Native 1080p
720p and 1080p, optional super-resolution module
Broadcast-grade out of the box
Duration
3–15 seconds
Multi-shot composition with character continuity in a single clip
Multiple cuts, one generation
Aspect Ratios
16:9 · 9:16 · 1:1 · 4:3 · 3:4
24 fps across all formats
Every channel, one pipeline
Latency
~80–160s per 5s clip
Measured end-to-end on the Segmind API, 720p to 1080p · DMD-2 distillation, 8 denoising steps, no CFG
8 steps vs the 25 to 50 typical for diffusion video
Audio
Joint generation
Dialogue, ambient, Foley in the same forward pass · 7-language lip-sync
~14.6% WER, lowest of any production video model

Head to Head

The model that beats Seedance 2.0.

Alibaba's flagship leads three of four Artificial Analysis Video Arena tracks against the current state of the art, with audio always included and 7-language lip-sync built in.

Arena Standing

3 of 4

tracks at #1 on the Artificial Analysis Video Arena

Audio Included

$0.30/sec

at 1080p with synced audio. Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add it.

Generation Time

~80–160s

end-to-end for a 5-second clip on the Segmind API, 720p to 1080p

Track

HappyHorse 1.0Alibaba

Seedance 2.0ByteDance · current SOTA

Kling 3.0 ProKuaishou

T2V (no audio)Text-to-Video
#1
Elo ~1389
Elo ~1273
Elo ~1247
I2V (no audio)Image-to-Video
#1
Elo ~1416
Elo ~1355
n/a
T2V (with audio)Audio-on Text-to-Video
#1
+11 Elo
very close 2nd
n/a
I2V (with audio)Audio-on Image-to-Video
#2
−1 Elo
#1
+1 Elo
n/a
1080p price / secWith audio, on Segmind
$0.30
audio included
token-based
~$0.24 typical
$0.336
5s audio mode
Max clip lengthSingle-call duration
15s
multi-shot
10s
10s
Lip-sync languagesNative phoneme-level
7
~14.6% WER
n/a
n/a

Artificial Analysis Video Arena, blind preference Elo, late-April 2026 snapshot. Arena prompts skew toward portrait and dialogue content; long-action and reference-edit categories remain competitive across models. All per-second pricing reflects current Segmind list pricing for the named models with audio enabled where applicable. Seedance 2.0 uses token-based pricing; the listed per-second figure is a typical 5-second 1080p call ($1.21 average).

Flexible plans for everyone

Whether you're just starting out or need enterprise-grade power, we have a plan that fits your needs.

Business

$99/mo

For working with production environments and professional use cases.

  • $99 monthly credits
  • 100 GB Storage
  • 500 RPM
  • 2 business day support
  • Pixelflow Premium Templates
Get Started

Scale

$599/mo

For large companies that requires custom solutions and private deployments.

  • $599 monthly credits
  • 1 TB Storage
  • 1000 RPM pooled
  • 1 business day support
  • Detailed usage analytics
Get Started

Enterprise

Custom solutions with enterprise-grade security and support

99.99% SLADedicated Slack supportSOC 2 compliance
Contact Sales

FAQ

Frequently Asked Questions

HappyHorse 1.0 is Alibaba's hosted flagship video model, available today via Segmind's production API. Weights are not publicly released; the production path is the Segmind API. Commercial usage rights are included with API access.

HappyHorse 1.0 on Segmind is priced at $0.175 per second at 720p and $0.30 per second at 1080p, with audio always included. That is 25% below Veo 3.1 with audio ($0.40 per second) and on par with Kling 3.0 Pro ($0.336 per second), while Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add audio. Seedance 2.0, the current closed SOTA, uses token-based pricing that averages roughly $1.21 per typical 5-second call. A 5-second 1080p HappyHorse clip with full audio costs about $1.50. There are no minimums or subscriptions: pay per generation.

Seven languages: Mandarin, Cantonese, English, Japanese, Korean, German, and French. Reported word-error rate is approximately 14.6%, the lowest of any production video model. The same prompt can be regenerated across all seven languages with phoneme-level lip-sync, eliminating dub-and-relipsync passes for cross-border ad campaigns.

HappyHorse leads the Artificial Analysis Video Arena's blind-preference Elo on three of four tracks: text-to-video (no audio), image-to-video (no audio), and audio-on text-to-video. Seedance 2.0 still leads audio-on image-to-video by ~1 Elo and remains competitive in long-action and reference-heavy editing scenarios. Kling 3.0 Pro is comparable on price ($0.336 per second with audio vs $0.30 for HappyHorse) but caps at 10 seconds per clip; HappyHorse extends to 15 seconds. The differentiator for HappyHorse is bundled audio, native 7-language lip-sync, and longer single-call clips at a lower per-second price.

Three to fifteen seconds, with multi-shot composition inside the clip. Quality is highest in the 3 to 8 second band; community testing reports visible drift in character consistency past about 10 seconds. For longer narratives, generate multiple clips and compose externally.

End-to-end on the Segmind API, a 5-second clip returns in approximately 80 to 160 seconds depending on resolution and queue depth. The model itself runs in only 8 denoising steps thanks to DMD-2 distillation with no classifier-free guidance; total wall-clock time includes upstream queueing on Alibaba's inference cluster.

Yes. Commercial usage rights are included with API access through Segmind. The model is available worldwide via the Segmind API.

Now on Segmind

Cinema-grade output.
Cents per second.

Generate sound and sight together. Text, image, reference, and edit inputs in one endpoint. Roughly 80 to 160 seconds per clip on the Segmind API.