Grok Imagine Video 1.5 (Preview)

Image-to-video with native synchronized audio, up to 720p.

~41.52s

Inputs

Describe motion, camera moves and sound. Keep it short and motion-focused.

Video length, 1-15 seconds, billed per output second. Use 6-10s for most clips.

Range: 1 - 15
6

Image to animate, URL or base64. Match aspect ratio to image orientation.

Preview

Output resolution, 480p or 720p; higher costs more. Use 720p for hero assets.

Output aspect ratio. 16:9 landscape, 9:16 social verticals, 1:1 square.

Examples

--

Grok Imagine Video 1.5 (Preview) — Image-to-Video Generation Model

Grok Imagine Video 1.5 Preview is xAI's latest image-to-video AI model. It turns a single still image into a fluid, cinematic video clip — with natively synchronized audio — guided by a natural-language prompt.

What is Grok Imagine Video 1.5?

Released in preview on May 30, 2026, Grok Imagine Video 1.5 animates a starting frame into up to 15 seconds of 24fps video at 480p or 720p. Give it an image and a prompt describing the motion, and it renders camera moves, atmosphere, and physics while staying faithful to the detail and lighting of your source image. It debuted at #1 on the Artificial Analysis Image-to-Video Arena leaderboard, ahead of Runway, Kling, and Veo.

Key Features

  • Native synchronized audio — dialogue, sound effects, ambient sound, and music are generated in the same inference pass, not added afterwards.
  • Source-image fidelity — the output continues your image rather than reinterpreting it, preserving subject, lighting, and composition.
  • Promptable camera direction — describe push-ins, pans, pacing, and sound design in plain language.
  • Flexible output — 1-15 second clips, 480p or 720p, and seven aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4, 3:2, 2:3).
  • Fast generation — a 6-second clip typically renders in about 30 seconds.

Best Use Cases

Animate product shots into lifestyle video ads, turn concept art or storyboards into moving sequences, produce vertical 9:16 social clips with sound, and chain shots together — stage each frame as an image, animate it, and cut the clips into longer scenes with a consistent look.

Prompt Tips and Output Quality

The input image anchors the content, so keep prompts short and motion-focused: describe the camera move, the subject's action, and the soundscape. In our testing, outputs tracked the source image closely with coherent, dynamic motion and a clean synchronized audio track. Match the aspect ratio to your input image orientation for best framing.

FAQs

Does Grok Imagine Video 1.5 support text-to-video? No. An input image is required. Generate a frame with a text-to-image model first, then animate it.

Does it generate sound? Yes — audio is generated natively and synchronized with the video, a standout versus most image-to-video models.

How long can the videos be? 1 to 15 seconds per clip. Chain multiple shots for longer sequences.

What resolutions are supported? 480p and 720p at 24fps, across seven aspect ratios.

Can I control the camera? Yes. Describe camera moves like slow push-ins, pans, or tracking shots directly in the prompt.

How fast is it? A 6-second 480p clip generates in roughly 30 seconds via the Segmind API.