Grok Imagine Video 1.5 (Preview) — Image-to-Video Generation Model
Grok Imagine Video 1.5 Preview is xAI's latest image-to-video AI model. It turns a single still image into a fluid, cinematic video clip — with natively synchronized audio — guided by a natural-language prompt.
What is Grok Imagine Video 1.5?
Released in preview on May 30, 2026, Grok Imagine Video 1.5 animates a starting frame into up to 15 seconds of 24fps video at 480p or 720p. Give it an image and a prompt describing the motion, and it renders camera moves, atmosphere, and physics while staying faithful to the detail and lighting of your source image. It debuted at #1 on the Artificial Analysis Image-to-Video Arena leaderboard, ahead of Runway, Kling, and Veo.
Key Features
- •Native synchronized audio — dialogue, sound effects, ambient sound, and music are generated in the same inference pass, not added afterwards.
- •Source-image fidelity — the output continues your image rather than reinterpreting it, preserving subject, lighting, and composition.
- •Promptable camera direction — describe push-ins, pans, pacing, and sound design in plain language.
- •Flexible output — 1-15 second clips, 480p or 720p, and seven aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3).
- •Fast generation — a 6-second clip typically renders in about 30 seconds.
Best Use Cases
Animate product shots into lifestyle video ads, turn concept art or storyboards into moving sequences, produce vertical 9:16 social clips with sound, and chain shots together — stage each frame as an image, animate it, and cut the clips into longer scenes with a consistent look.
Prompt Tips and Output Quality
The input image anchors the content, so keep prompts short and motion-focused: describe the camera move, the subject's action, and the soundscape. In our testing, outputs tracked the source image closely with coherent, dynamic motion and a clean synchronized audio track. Match the aspect ratio to your input image orientation for best framing.
FAQs
Does Grok Imagine Video 1.5 support text-to-video? No. An input image is required. Generate a frame with a text-to-image model first, then animate it.
Does it generate sound? Yes — audio is generated natively and synchronized with the video, a standout versus most image-to-video models.
How long can the videos be? 1 to 15 seconds per clip. Chain multiple shots for longer sequences.
What resolutions are supported? 480p and 720p at 24fps, across seven aspect ratios.
Can I control the camera? Yes. Describe camera moves like slow push-ins, pans, or tracking shots directly in the prompt.
How fast is it? A 6-second 480p clip generates in roughly 30 seconds via the Segmind API.