Grok Imagine Video 1.5 (Preview) — Image-to-Video Generation Model

Grok Imagine Video 1.5 Preview is xAI's latest image-to-video AI model. It turns a single still image into a fluid, cinematic video clip — with natively synchronized audio — guided by a natural-language prompt.

What is Grok Imagine Video 1.5?

Released in preview on May 30, 2026, Grok Imagine Video 1.5 animates a starting frame into up to 15 seconds of 24fps video at 480p or 720p. Give it an image and a prompt describing the motion, and it renders camera moves, atmosphere, and physics while staying faithful to the detail and lighting of your source image. It debuted at #1 on the Artificial Analysis Image-to-Video Arena leaderboard, ahead of Runway, Kling, and Veo.

Key Features

•Native synchronized audio — dialogue, sound effects, ambient sound, and music are generated in the same inference pass, not added afterwards.
•Source-image fidelity — the output continues your image rather than reinterpreting it, preserving subject, lighting, and composition.
•Promptable camera direction — describe push-ins, pans, pacing, and sound design in plain language.
•Flexible output — 1-15 second clips, 480p or 720p, and seven aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3).
•Fast generation — a 6-second clip typically renders in about 30 seconds.

Best Use Cases

Animate product shots into lifestyle video ads, turn concept art or storyboards into moving sequences, produce vertical 9:16 social clips with sound, and chain shots together — stage each frame as an image, animate it, and cut the clips into longer scenes with a consistent look.

Prompt Tips and Output Quality

The input image anchors the content, so keep prompts short and motion-focused: describe the camera move, the subject's action, and the soundscape. In our testing, outputs tracked the source image closely with coherent, dynamic motion and a clean synchronized audio track. Match the aspect ratio to your input image orientation for best framing.

FAQs

Does Grok Imagine Video 1.5 support text-to-video? No. An input image is required. Generate a frame with a text-to-image model first, then animate it.

Does it generate sound? Yes — audio is generated natively and synchronized with the video, a standout versus most image-to-video models.

How long can the videos be? 1 to 15 seconds per clip. Chain multiple shots for longer sequences.

What resolutions are supported? 480p and 720p at 24fps, across seven aspect ratios.

Can I control the camera? Yes. Describe camera moves like slow push-ins, pans, or tracking shots directly in the prompt.

How fast is it? A 6-second 480p clip generates in roughly 30 seconds via the Segmind API.

Grok Imagine Video 1.5 (Preview)