HappyHorse 1.0.
Cinema. In Every Language.
Alibaba's flagship video model. Built to challenge Seedance 2.0.
1080p with native audio, 7-language lip-sync, multi-shot up to 15s. From $0.175 per second.
What Makes It Cinematic
Cinematic. Multilingual.
#1 on three of four arena tracks.
Cinema-grade output,
Alibaba's answer to Seedance 2.0.
15-billion-parameter unified Transformer. Beats Seedance 2.0 on three of four Artificial Analysis Video Arena tracks, while undercutting it on per-second pricing.
Voice. Ambient. Foley.
One forward pass.
Veo 3.1 and Kling 3.0 Pro charge 50 to 100% more per second to add audio. HappyHorse generates video and synced audio in one Transformer pass, at no extra cost.
Phoneme-level. Out of the box.
Lowest reported word-error rate of any production video model, at ~14.6% across all seven supported languages.
From $0.175/sec
Run 100 variants for the cost of one studio reshoot. No subscriptions, no minimums.
One prompt.
Multiple cuts. Native 1080p.
Up to 15 seconds with character continuity across shot changes. The model decides coverage automatically.
How It Works
Brief to cinematic spot.
For pocket change per second.
Direct & Reference
Prompt + up to 9 references.
Start from a text prompt, an image, or a multi-asset brief: faces, environments, palettes, voice samples. One unified endpoint, /v1/happyhorse, accepts every combination of inputs.
Reference images lock identity and style. Voice references guide the dialogue. The model figures out the role of each asset.
9 images · 3 videos · voice sample
Generate Sound + Vision
One forward pass. ~80–160 seconds on Segmind.
A unified single-stream Transformer denoises text, image, video, and audio tokens together. Eight DMD-2 steps, classifier-free guidance disabled.
1080p, 24 fps, multi-shot, dialogue and Foley aligned to the frame. No separate audio model, no post-production stitch.
Edit, Localize, Iterate
20+ variants for the cost of one studio shot.
Pass a source video plus reference images to swap a product, wardrobe, or background while preserving motion and timing. Re-lipsync existing footage to any of 7 languages without reshooting.
At $0.30 per second, you can run a dozen creative variants for a single ad spot's traditional production budget.
Use Cases
Built for commerce.
Cinematic by default.

Product video at catalogue scale
Image-to-video PDP showcases with push-ins, soft light sweeps, matched Foley. Batch-generate 20 variants from a single hero shot.

One creative. Seven languages.
Native phoneme-level lip-sync in Mandarin, Cantonese, English, Japanese, Korean, German, and French. No dub-and-relipsync passes.

Persistent live-commerce presenters
Subject-to-video with up to 9 reference images. Identity preserved across cuts and languages, perfect for branded virtual hosts and mascots.

Vertical video with native audio
9:16 generations for TikTok, Reels, Shorts, Douyin, and Xiaohongshu. Dialogue and ambient sound in the same pass, no separate dubbing pipeline.

Cinematic spots, dialogue and all
Multi-shot brand films with dialogue, ambient sound, and Foley produced together. Production quality formerly locked behind closed APIs.

Swap product, wardrobe, or background
Video-to-video and subject-and-video-to-video edits change on-screen elements while preserving camera moves, motion, and timing. No reshoot.
Under the Hood
Built different.
Architecturally speaking.
Unified Token Sequence
Text, image, video, and audio tokens are jointly denoised in one Transformer. No separate audio model, no cross-attention. The model learns alignment, not stitching.
DMD-2 Distillation
Distribution Matching Distillation v2 cuts inference to 8 denoising steps and runs without classifier-free guidance. Far below the 25 to 50 steps typical for diffusion video.
Per-Head Modality Gates
Learned scalar gates dampen destructive cross-modal gradients during joint training, keeping the audio and video branches from competing for capacity. Reported in technical notes, not yet peer-reviewed.
Technical Specifications
Head to Head
The model that beats Seedance 2.0.
Alibaba's flagship leads three of four Artificial Analysis Video Arena tracks against the current state of the art, with audio always included and 7-language lip-sync built in.
3 of 4
tracks at #1 on the Artificial Analysis Video Arena
$0.30/sec
at 1080p with synced audio. Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add it.
~80–160s
end-to-end for a 5-second clip on the Segmind API, 720p to 1080p
Track
HappyHorse 1.0Alibaba
Seedance 2.0ByteDance · current SOTA
Kling 3.0 ProKuaishou
Artificial Analysis Video Arena, blind preference Elo, late-April 2026 snapshot. Arena prompts skew toward portrait and dialogue content; long-action and reference-edit categories remain competitive across models. All per-second pricing reflects current Segmind list pricing for the named models with audio enabled where applicable. Seedance 2.0 uses token-based pricing; the listed per-second figure is a typical 5-second 1080p call ($1.21 average).
Flexible plans for everyone
Whether you're just starting out or need enterprise-grade power, we have a plan that fits your needs.
Business
For working with production environments and professional use cases.
- $99 monthly credits
- 100 GB Storage
- 500 RPM
- 2 business day support
- Pixelflow Premium Templates
Scale
For large companies that requires custom solutions and private deployments.
- $599 monthly credits
- 1 TB Storage
- 1000 RPM pooled
- 1 business day support
- Detailed usage analytics
Enterprise
Custom solutions with enterprise-grade security and support
FAQ
Frequently Asked Questions
HappyHorse 1.0 is Alibaba's hosted flagship video model, available today via Segmind's production API. Weights are not publicly released; the production path is the Segmind API. Commercial usage rights are included with API access.
HappyHorse 1.0 on Segmind is priced at $0.175 per second at 720p and $0.30 per second at 1080p, with audio always included. That is 25% below Veo 3.1 with audio ($0.40 per second) and on par with Kling 3.0 Pro ($0.336 per second), while Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add audio. Seedance 2.0, the current closed SOTA, uses token-based pricing that averages roughly $1.21 per typical 5-second call. A 5-second 1080p HappyHorse clip with full audio costs about $1.50. There are no minimums or subscriptions: pay per generation.
Seven languages: Mandarin, Cantonese, English, Japanese, Korean, German, and French. Reported word-error rate is approximately 14.6%, the lowest of any production video model. The same prompt can be regenerated across all seven languages with phoneme-level lip-sync, eliminating dub-and-relipsync passes for cross-border ad campaigns.
HappyHorse leads the Artificial Analysis Video Arena's blind-preference Elo on three of four tracks: text-to-video (no audio), image-to-video (no audio), and audio-on text-to-video. Seedance 2.0 still leads audio-on image-to-video by ~1 Elo and remains competitive in long-action and reference-heavy editing scenarios. Kling 3.0 Pro is comparable on price ($0.336 per second with audio vs $0.30 for HappyHorse) but caps at 10 seconds per clip; HappyHorse extends to 15 seconds. The differentiator for HappyHorse is bundled audio, native 7-language lip-sync, and longer single-call clips at a lower per-second price.
Three to fifteen seconds, with multi-shot composition inside the clip. Quality is highest in the 3 to 8 second band; community testing reports visible drift in character consistency past about 10 seconds. For longer narratives, generate multiple clips and compose externally.
End-to-end on the Segmind API, a 5-second clip returns in approximately 80 to 160 seconds depending on resolution and queue depth. The model itself runs in only 8 denoising steps thanks to DMD-2 distillation with no classifier-free guidance; total wall-clock time includes upstream queueing on Alibaba's inference cluster.
Yes. Commercial usage rights are included with API access through Segmind. The model is available worldwide via the Segmind API.
Cinema-grade output.
Cents per second.
Generate sound and sight together. Text, image, reference, and edit inputs in one endpoint. Roughly 80 to 160 seconds per clip on the Segmind API.
