Wan 2.7 Image to Video

Animate any image into a cinematic video up to 1080P and 15 seconds with audio sync, first/last frame control, and multi-modal input.

~415.22s
$0.625 - $0.938 per generation

Inputs

Describe the motion, action, and scene you want generated. Use specific verbs and cinematic cues for best results — e.g., 'camera slowly pans left as wind moves through the trees'.

The first frame of the video — the model animates outward from this image. Works with still photos, illustrations, or AI-generated images.

Drag & drop image or click to browse

Supports image/*

The ending frame of the video. When used alongside First Frame, the model generates a smooth transition between both images — ideal for morph sequences or scene transitions.

Drag & drop image or click to browse

Supports image/*

A publicly accessible audio file URL. The model syncs character motion and lip movement to the audio — useful for voiceover, dialogue, or music-driven animations.

Drag & drop audio or click to browse

Supports audio/*

Length of the generated video in seconds (2–15). Use shorter durations for quick previews; longer durations for cinematic clips or full scenes.

Examples

--

Wan 2.7 Image to Video — AI Video Generation Model

What is Wan 2.7 Image to Video?

Wan 2.7 Image to Video is Alibaba's latest generation AI video model, purpose-built for turning static images into high-quality cinematic video clips. Released in early 2026, it extends the Wan series with four distinct generation modes: text-prompt-only animation, first-frame conditioning, first-and-last-frame scene transitions, and audio-driven video synthesis. Output reaches up to 1080P resolution at 15 seconds per clip, delivered synchronously via API without polling.

Built on a Diffusion Transformer architecture with Full Attention, Wan 2.7 maintains character identity and scene consistency across the full clip duration — a key differentiator for productions requiring stable subjects over time.

Key Features

  • First & Last Frame Control: Provide a start image and an end image; the model generates a smooth, consistent transition between them. Ideal for morphs, product reveals, and scene cuts.
  • Audio-Driven Generation: Attach an audio URL and the model synchronizes character motion and lip movement to speech, music, or sound effects.
  • 720P / 1080P Output: Choose resolution based on use case — 720P for rapid prototyping, 1080P for final deliverables.
  • Up to 15 Seconds: Generate clips from 2 to 15 seconds, enabling short social content through to longer scene segments.
  • Negative Prompt Support: Suppress unwanted artifacts like blur, distortion, or watermarks for cleaner output.
  • Seed Control: Fix a seed to reproduce identical outputs — useful for prompt iteration and A/B testing.

Best Use Cases

Social & Creator Content: Animate portrait photos, product images, or AI-generated art into short-form clips for Instagram, TikTok, and YouTube Shorts.

Marketing & Advertising: Create talking-head spokespersons from a single reference photo synced to a voiceover track. Generate product transition videos with first/last frame control.

Film & Post Production: Use as a rapid pre-vis tool — animate storyboard frames or concept art into moving sequences to communicate shot intent before production.

E-Commerce: Bring product photography to life with subtle motion, camera pulls, or scene transitions to increase engagement on listing pages.

Game & Interactive Media: Animate character art or environment concept pieces for trailers, demos, or pitch materials.

Prompt Tips and Output Quality

Write prompts as motion directions rather than scene descriptions. Wan 2.7 responds well to verbs and camera language: "camera slowly zooms in," "subject turns toward camera," "leaves flutter in the wind." Avoid overloading with too many simultaneous actions — pick one or two key motions per prompt for the most coherent result.

For audio-driven clips, ensure your audio file is publicly accessible (direct S3/CDN URL, not a streaming link). Lip-sync accuracy is strongest on clear single-speaker dialogue at moderate speech pace; fast-paced dialogue may show minor timing drift.

When using first+last frame mode, ensure both images share a consistent perspective and scale. Large perspective shifts between frames can produce unrealistic morphs.

Use 720P during prompt exploration, then switch to 1080P for final outputs — quality gains are most visible on detail-rich scenes and close-up facial content.

FAQs

What is the maximum video length Wan 2.7 can generate? The maximum duration is 15 seconds per request. For longer content, generate multiple clips and stitch them in post.

Can I animate a photo of a real person? Yes. Provide the photo as the first frame and a motion prompt or audio URL. The model animates the subject while preserving their likeness throughout the clip.

How does audio-driven generation work? Pass a public audio file URL via the audio_url parameter. The model analyzes the audio and synchronizes character facial movement and body motion to the rhythm, speech, or sound cues in the track.

What image formats are supported for first/last frames? The model accepts standard image URLs (JPEG, PNG, WebP) or base64-encoded image data.

How does Wan 2.7 compare to Wan 2.6? Wan 2.7 adds integrated first+last frame control, audio-driven synthesis, and improved character consistency in a single unified model — features that previously required switching between checkpoints in the 2.6 family.

What should I set for the negative prompt? Common effective values: blurry, low quality, distorted faces, watermark, flickering, artifacts. For portrait/character content, add extra limbs, bad anatomy to suppress common generation artifacts.