SAM-Audio: Audio Source Separation Model

What is SAM-Audio?

SAM-Audio is a foundation AI model from Meta designed for audio source separation: isolating a target sound (like “drums”, “speech”, or “siren”) from a mixed recording. Instead of separating only fixed stems, SAM-Audio aims to segment any sound you describe, making it useful for modern audio editing pipelines, sound event detection, and multimedia analysis.

On Segmind, you provide an audio input and a sound description prompt. The model returns an isolated track containing the requested source, enabling workflows like “extract vocals from a song”, “remove background noise”, or “pull out footsteps from a scene”.

Key Features

•Prompted sound isolation using natural language text (e.g., “keyboard typing”, “female narration”).
•Fine-grained separation for complex mixes (music, ambience, dialogue + effects).
•Developer-friendly inputs: audio via URL or Base64, with selectable output format.
•Quality tuning with reranking to improve the best-candidate separation.
•Strong fit for automated pipelines (moderation, indexing, annotation, post-production).

Best Use Cases

•Audio editing & post-production: isolate dialogue, ambience, SFX, instruments.
•Content creation: remixing, stem-like extraction, cleaner voiceovers.
•Sound event detection: extract target events before classification or labeling.
•Multimedia & video analysis: separate scene sounds for search and retrieval.
•Accessibility: enhance speech tracks for transcription and captioning.

Prompt Tips and Output Quality

•Be specific: “snare drum hits” often separates better than “drums”.
•Include context: “crowd cheering in a stadium” vs. “cheering”.
•If multiple similar sources exist, add qualifiers: “lead vocal”, “background chatter”.
•Use output_format: wav for highest fidelity; mp3 for smaller files.
•Increase reranking_candidates (1–8) when the separation is close but imperfect; higher values typically improve selection at the cost of more computation.

Core parameters

•audio (required): URL/Base64 for the input audio.
•description (required): the sound to isolate.
•output_format (optional): wav or mp3 (default wav).
•reranking_candidates (optional, advanced): candidate count (default 4).

FAQs

Is SAM-Audio open-source?
Meta publishes research assets for SAM-Audio, but licensing and usage terms may vary by distribution. Check the upstream repository/terms for your deployment scenario.

How is SAM-Audio different from stem splitters (vocals/drums/bass)?
It’s prompt-driven: you can target any described sound, not only fixed music stems.

What should I put in description for best results?
Use a concise noun phrase plus qualifiers (instrument, source, environment), e.g., “male speech in a car”, “dog barking”, “hi-hat pattern”.

Should I choose WAV or MP3 output?
Choose WAV for editing and evaluation; choose MP3 for lightweight previews and distribution.

What does reranking_candidates do?
It controls how many separation candidates are generated and reranked; increasing it can improve the final isolated track when prompts are ambiguous.

Sam Audio Large

Inputs

Examples

SAM-Audio: Audio Source Separation Model

What is SAM-Audio?

Key Features

Best Use Cases

Prompt Tips and Output Quality

FAQs

Popular Models

Veo 3.1

GPT Image 1 Edit Mini

Wan 2.2 Image to Video Fast

Seedance 1.0 Pro