New from Alibaba · Now on Segmind

HappyHorse 1.0.
Cinema. In Every Language.

Alibaba's flagship video model. Built to challenge Seedance 2.0.
1080p with native audio, 7-language lip-sync, multi-shot up to 15s. From $0.175 per second.

Try HappyHorse 1.0 View API Docs

What Makes It Cinematic

Cinematic. Multilingual.
#1 on three of four arena tracks.

Cinema Challenger

Cinema-grade output,
Alibaba's answer to Seedance 2.0.

15-billion-parameter unified Transformer. Beats Seedance 2.0 on three of four Artificial Analysis Video Arena tracks, while undercutting it on per-second pricing.

#1 ON ARENA

ALIBABA AI

Native Audio-Video

Voice. Ambient. Foley.
One forward pass.

Veo 3.1 and Kling 3.0 Pro charge 50 to 100% more per second to add audio. HappyHorse generates video and synced audio in one Transformer pass, at no extra cost.

Video

Dialogue

Foley

Audio Premium / secvs base price

HappyHorse 1.0

Included

Kling 3.0 Pro

+50%

Veo 3.1

+100%

7-Language Lip-Sync

Phoneme-level. Out of the box.

Lowest reported word-error rate of any production video model, at ~14.6% across all seven supported languages.

MandarinCantoneseEnglishJapaneseKoreanGermanFrench

Real Pricing

From $0.175/sec

Run 100 variants for the cost of one studio reshoot. No subscriptions, no minimums.

720p

$0.175/sec

1080p

$0.30/sec

Multi-Shot · 1080p · 15s

One prompt.
Multiple cuts. Native 1080p.

Up to 15 seconds with character continuity across shot changes. The model decides coverage automatically.

W

CU

RS

M

Wide

Close-up

Reverse

Medium

View full technical specs

How It Works

Brief to cinematic spot.
For pocket change per second.

01

Direct & Reference

Prompt + up to 9 references.

Start from a text prompt, an image, or a multi-asset brief: faces, environments, palettes, voice samples. One unified endpoint, /v1/happyhorse, accepts every combination of inputs.

Reference images lock identity and style. Voice references guide the dialogue. The model figures out the role of each asset.

Drop references
9 images · 3 videos · voice sample

@face.png@style.jpg@voice.mp3@motion.mp4

02

Generate Sound + Vision

One forward pass. ~80–160 seconds on Segmind.

A unified single-stream Transformer denoises text, image, video, and audio tokens together. Eight DMD-2 steps, classifier-free guidance disabled.

1080p, 24 fps, multi-shot, dialogue and Foley aligned to the frame. No separate audio model, no post-production stitch.

Inference timeline

Token sequence

Video stream

Audio stream

8 denoising steps · no CFG

~80–160s @ 1080p

03

Edit, Localize, Iterate

20+ variants for the cost of one studio shot.

Pass a source video plus reference images to swap a product, wardrobe, or background while preserving motion and timing. Re-lipsync existing footage to any of 7 languages without reshooting.

At $0.30 per second, you can run a dozen creative variants for a single ad spot's traditional production budget.

Variant cost ledger

EN

Hook A · English

$1.50

ZH

Hook A · Mandarin

$1.50

JA

Hook B · Japanese

$1.50

3 variants · $4.50 total · 5-second 1080p

Use Cases

Built for commerce.
Cinematic by default.

E-Commerce

Product video at catalogue scale

Product video at catalogue scale

Image-to-video PDP showcases with push-ins, soft light sweeps, matched Foley. Batch-generate 20 variants from a single hero shot.

Multilingual Ads

One creative. Seven languages.

One creative. Seven languages.

Native phoneme-level lip-sync in Mandarin, Cantonese, English, Japanese, Korean, German, and French. No dub-and-relipsync passes.

AI Hosts

Persistent live-commerce presenters

Persistent live-commerce presenters

Subject-to-video with up to 9 reference images. Identity preserved across cuts and languages, perfect for branded virtual hosts and mascots.

Short-Form Social

Vertical video with native audio

Vertical video with native audio

9:16 generations for TikTok, Reels, Shorts, Douyin, and Xiaohongshu. Dialogue and ambient sound in the same pass, no separate dubbing pipeline.

Brand Films

Cinematic spots, dialogue and all

Multi-shot brand films with dialogue, ambient sound, and Foley produced together. Production quality formerly locked behind closed APIs.

Edit & Replace

Swap product, wardrobe, or background

Swap product, wardrobe, or background

Video-to-video and subject-and-video-to-video edits change on-screen elements while preserving camera moves, motion, and timing. No reshoot.

Under the Hood

Built different.
Architecturally speaking.

Unified Token Sequence

Text, image, video, and audio tokens are jointly denoised in one Transformer. No separate audio model, no cross-attention. The model learns alignment, not stitching.

DMD-2 Distillation

Distribution Matching Distillation v2 cuts inference to 8 denoising steps and runs without classifier-free guidance. Far below the 25 to 50 steps typical for diffusion video.

Per-Head Modality Gates

Learned scalar gates dampen destructive cross-modal gradients during joint training, keeping the audio and video branches from competing for capacity. Reported in technical notes, not yet peer-reviewed.

Technical Specifications

Architecture

Unified Single-Stream Transformer

15B parameters, 40 layers, no cross-attention modules

Tighter cross-modal alignment

Resolution

Native 1080p

720p and 1080p, optional super-resolution module

Broadcast-grade out of the box

Duration

3–15 seconds

Multi-shot composition with character continuity in a single clip

Multiple cuts, one generation

Aspect Ratios

16:9 · 9:16 · 1:1 · 4:3 · 3:4

24 fps across all formats

Every channel, one pipeline

Latency

~80–160s per 5s clip

Measured end-to-end on the Segmind API, 720p to 1080p · DMD-2 distillation, 8 denoising steps, no CFG

8 steps vs the 25 to 50 typical for diffusion video

Audio

Joint generation

Dialogue, ambient, Foley in the same forward pass · 7-language lip-sync

~14.6% WER, lowest of any production video model

Head to Head

The model that beats Seedance 2.0.

Alibaba's flagship leads three of four Artificial Analysis Video Arena tracks against the current state of the art, with audio always included and 7-language lip-sync built in.

Arena Standing

3 of 4

tracks at #1 on the Artificial Analysis Video Arena

Audio Included

$0.30/sec

at 1080p with synced audio. Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add it.

Generation Time

~80–160s

end-to-end for a 5-second clip on the Segmind API, 720p to 1080p

Track

HappyHorse 1.0Alibaba

Seedance 2.0ByteDance · current SOTA

Kling 3.0 ProKuaishou

T2V (no audio)Text-to-Video

#1

Elo ~1389

Elo ~1273

Elo ~1247

I2V (no audio)Image-to-Video

#1

Elo ~1416

Elo ~1355

n/a

T2V (with audio)Audio-on Text-to-Video

#1

+11 Elo

very close 2nd

n/a

I2V (with audio)Audio-on Image-to-Video

#2

−1 Elo

#1

+1 Elo

n/a

1080p price / secWith audio, on Segmind

$0.30

audio included

token-based

~$0.24 typical

$0.336

5s audio mode

Max clip lengthSingle-call duration

15s

multi-shot

10s

10s

Lip-sync languagesNative phoneme-level

7

~14.6% WER

n/a

n/a

Artificial Analysis Video Arena, blind preference Elo, late-April 2026 snapshot. Arena prompts skew toward portrait and dialogue content; long-action and reference-edit categories remain competitive across models. All per-second pricing reflects current Segmind list pricing for the named models with audio enabled where applicable. Seedance 2.0 uses token-based pricing; the listed per-second figure is a typical 5-second 1080p call ($1.21 average).

Try HappyHorse 1.0 on Segmind

Flexible plans for everyone

Whether you're just starting out or need enterprise-grade power, we have a plan that fits your needs.

Business

$99/mo

For working with production environments and professional use cases.

$99 monthly credits
100 GB Storage
500 RPM
2 business day support
Pixelflow Premium Templates

Scale

$599/mo

For large companies that requires custom solutions and private deployments.

$599 monthly credits
1 TB Storage
1000 RPM pooled
1 business day support
Detailed usage analytics

Enterprise

Custom solutions with enterprise-grade security and support

99.99% SLADedicated Slack supportSOC 2 compliance

FAQ

Frequently Asked Questions

HappyHorse 1.0 is Alibaba's hosted flagship video model, available today via Segmind's production API. Weights are not publicly released; the production path is the Segmind API. Commercial usage rights are included with API access.

HappyHorse 1.0 on Segmind is priced at $0.175 per second at 720p and $0.30 per second at 1080p, with audio always included. That is 25% below Veo 3.1 with audio ($0.40 per second) and on par with Kling 3.0 Pro ($0.336 per second), while Veo 3.1 and Kling 3.0 Pro charge 50 to 100% extra to add audio. Seedance 2.0, the current closed SOTA, uses token-based pricing that averages roughly $1.21 per typical 5-second call. A 5-second 1080p HappyHorse clip with full audio costs about $1.50. There are no minimums or subscriptions: pay per generation.

Seven languages: Mandarin, Cantonese, English, Japanese, Korean, German, and French. Reported word-error rate is approximately 14.6%, the lowest of any production video model. The same prompt can be regenerated across all seven languages with phoneme-level lip-sync, eliminating dub-and-relipsync passes for cross-border ad campaigns.

HappyHorse leads the Artificial Analysis Video Arena's blind-preference Elo on three of four tracks: text-to-video (no audio), image-to-video (no audio), and audio-on text-to-video. Seedance 2.0 still leads audio-on image-to-video by ~1 Elo and remains competitive in long-action and reference-heavy editing scenarios. Kling 3.0 Pro is comparable on price ($0.336 per second with audio vs $0.30 for HappyHorse) but caps at 10 seconds per clip; HappyHorse extends to 15 seconds. The differentiator for HappyHorse is bundled audio, native 7-language lip-sync, and longer single-call clips at a lower per-second price.

Three to fifteen seconds, with multi-shot composition inside the clip. Quality is highest in the 3 to 8 second band; community testing reports visible drift in character consistency past about 10 seconds. For longer narratives, generate multiple clips and compose externally.

End-to-end on the Segmind API, a 5-second clip returns in approximately 80 to 160 seconds depending on resolution and queue depth. The model itself runs in only 8 denoising steps thanks to DMD-2 distillation with no classifier-free guidance; total wall-clock time includes upstream queueing on Alibaba's inference cluster.

Yes. Commercial usage rights are included with API access through Segmind. The model is available worldwide via the Segmind API.

Now on Segmind

Cinema-grade output.
Cents per second.

Generate sound and sight together. Text, image, reference, and edit inputs in one endpoint. Roughly 80 to 160 seconds per clip on the Segmind API.

Try HappyHorse 1.0 View API Docs