Bytedance HuMo: Human-Centric Video Generation

HuMo generates high-quality, human-centric videos from text, images, and audio with unparalleled control and precision.


API

If you're looking for an API, you can choose from your desired programming language.

POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import requests import base64 # Use this function to convert an image file from the filesystem to base64 def image_file_to_base64(image_path): with open(image_path, 'rb') as f: image_data = f.read() return base64.b64encode(image_data).decode('utf-8') # Use this function to fetch an image from a URL and convert it to base64 def image_url_to_base64(image_url): response = requests.get(image_url) image_data = response.content return base64.b64encode(image_data).decode('utf-8') # Use this function to convert a list of image URLs to base64 def image_urls_to_base64(image_urls): return [image_url_to_base64(url) for url in image_urls] api_key = "YOUR_API_KEY" url = "https://api.segmind.com/v1/bytedance-humo" # Request payload data = { "frames": 30, "scale_a": 5, "scale_t": 5, "mode": "TA", "height": 720, "width": 1280, "steps": 30 } headers = {'x-api-key': api_key} response = requests.post(url, json=data, headers=headers) print(response.content) # The response is the generated image
RESPONSE
video/mp4
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


framesint ( default: 30 )

Number of frames for the generated video

min : 10,

max : 100


scale_afloat ( default: 5 )

Strength of audio guidance. Higher = better audio-motion sync

min : 1,

max : 10


scale_tfloat ( default: 5 )

Strength of text guidance. Higher = better adherence to text prompts

min : 1,

max : 10


modeenum:str ( default: TA )

Input mode: TA for text+audio; TIA for text+image+audio.

Allowed values:


heightenum:int ( default: 720 )

Video height (e.g., 720 or 480).

Allowed values:


widthenum:int ( default: 1280 )

Video width (e.g., 1280 or 832).

Allowed values:


stepsint ( default: 30 )

min : 1,

max : 100

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Resources to get you started

Everything you need to know to get the most out of Bytedance HuMo: Human-Centric Video Generation

HuMo: Multimodal Human Video Generation Model

HuMo is a state-of-the-art video synthesis engine that transforms text, images, and audio into high-quality, human-centric videos. Whether you’re producing character animations, music-driven performances, or instructional clips, this guide will help you unlock HuMo’s full potential.

Getting Started

  1. •Choose your generation mode:
    • •Text-to-image: Define appearance and scene.
    • •Text-to-audio: Drive motion with sound.
    • •Full multimodal: Combine text, reference images, and audio.
  2. •Install dependencies and load the HuMo model via your preferred SDK.
  3. •Prepare input assets: clear text prompts, high-quality reference images, and well-edited audio tracks.

Core Parameters

  • •video_resolution (string, required)
    • 1080p for final delivery (slightly longer render times)
    • 720p for rapid iteration
  • •frame_rate (integer, required)
    • 30 fps for standard motion
    • 60 fps for fluid, dynamic scenes
  • •duration (integer, optional)
    • 10–60 seconds (short clips vs. extended sequences)
  • •background_music (string, optional)
    • “upbeat.mp3” for energetic scenes
    • “nature.wav” for calm, ambient moods
  • •human_pose (string, required)
    • “walking”, “standing”, “running” or custom labels
  • •scene (string, optional)
    • “city”, “beach”, “studio” etc.
  • •emotion (string, optional)
    • “happy”, “neutral”, “sad” to influence facial/body language

Parameter Recommendations by Use Case

  1. •Digital Content Creation
    • •Resolution: 1080p
    • •Frame Rate: 30
    • •Duration: 15–30s
    • •Pose: “standing” or “walking”
    • •Scene: specify context (e.g., “modern loft”)
    • •Emotion: “neutral” for product showcases
  2. •Music-Synced Performances
    • •Resolution: 720p (faster presets)
    • •Frame Rate: 60
    • •Duration: match track length (30–60s)
    • •Background Music: “upbeat.mp3”
    • •Pose: “dancing” or “running”
    • •Emotion: “happy”
  3. •Educational Demonstrations
    • •Resolution: 1080p
    • •Frame Rate: 30
    • •Duration: 20–60s
    • •Pose: “standing”, “pointing”
    • •Scene: “classroom” or “laboratory”
    • •Emotion: “neutral” or “focused”
  4. •Marketing & Ads
    • •Resolution: 1080p
    • •Frame Rate: 30
    • •Duration: 10–20s
    • •Pose: “walking” toward camera
    • •Scene: branded environment
    • •Emotion: “happy”
  5. •Virtual Pre-viz
    • •Resolution: 720p
    • •Frame Rate: 30
    • •Duration: 10–30s
    • •Pose: “running” or “fighting”
    • •Scene: “forest” or “urban street”
    • •Emotion: “intense”

Prompting Tips

  • •Start with a vivid scene descriptor: “sunlit seaside boardwalk.”
  • •Embed human_pose and emotion early: “(human_pose: dancing, emotion: joyful).”
  • •Reference image quality matters: crisp, well-lit, front-facing shots.
  • •Sync audio beats: align key frames with musical peaks.

By following these guidelines and fine-tuning HuMo’s parameters, you can generate polished, expressive human videos suited to any project. Enjoy creating!

Other Popular Models

Discover other models you might be interested in.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.