# separate-audio

> Text-guided audio source separation using SAM-Audio via mlx-audio

- Author: Jiao Dong
- Repository: Nuva-Lab/vibecut
- Version: 20260131095536
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Nuva-Lab/vibecut
- Web: https://mule.run/skillshub/@@Nuva-Lab/vibecut~separate-audio:20260131095536

---

---
name: separate-audio
description: Text-guided audio source separation using SAM-Audio via mlx-audio
---

# separate-audio

Isolate specific sounds from audio using natural language text prompts. Uses Meta's SAM-Audio model via mlx-audio for native Mac M2/M3 inference.

## Capabilities

- **Text prompts**: Describe what to extract ("man speaking", "piano", "applause")
- **Time span hints**: Specify when target sound occurs for better isolation
- **Source separation**: Get both the target sound and the residual (everything else)

## Usage

```bash
# Extract speaker by description
python skills/separate-audio/separate.py panel.wav --prompt "man speaking" --output speaker.wav

# Extract with time hint
python skills/separate-audio/separate.py video.mp4 --prompt "applause" --span 10.5-12.0

# Save both target and residual
python skills/separate-audio/separate.py audio.wav --prompt "woman singing" --save-residual
```

## Use Cases for Video Production

| Use Case | Prompt Example |
|----------|----------------|
| Extract single speaker | "man speaking about investments" |
| Remove background music | Separate, keep residual |
| Isolate applause | "audience applause" |
| Clean panel discussion | Run multiple times with different prompts |

## Programmatic Usage

```python
from separate import separate_audio

result = separate_audio(
    audio_path="panel.wav",
    prompt="man speaking about space",
    output_path="speaker.wav",
    span=(10.5, 12.0),  # Optional time hint
)
print(result["target_path"])
```

## Notes

- Requires mlx-audio: `pip install mlx-audio`
- Runs natively on Mac M2/M3 via MLX framework
- First run downloads SAM-Audio model (~2GB)
- Works best with clear, specific descriptions
- Time spans help isolate sounds at specific moments

## Status

This skill is implemented but not extensively tested in the main video pipeline. The primary audio workflow uses Qwen3-ForcedAligner for caption alignment. SAM-Audio is available for advanced use cases like:
- Cleaning up panel discussion audio
- Extracting speaker voices for analysis
- Separating background noise from speech