# audio-transcribe

> Transcribe local audio files to SRT subtitles and TXT transcripts with timestamps using faster-whisper. Use when Claude needs to convert local audio files (MP3, M4A, WAV, etc.) into text transcripts with or without timestamps, generate subtitle files from audio, or perform speech-to-text transcription. Supports multiple Whisper model sizes for accuracy/speed tradeoffs, language specification, VAD filtering, and custom output directories.

- Author: yechw
- Repository: yechw/skills
- Version: 20260128235131
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-08
- Source: https://github.com/yechw/skills
- Web: https://mule.run/skillshub/@@yechw/skills~audio-transcribe:20260128235131

---

---
name: audio-transcribe
description: Transcribe local audio files to SRT subtitles and TXT transcripts with timestamps using faster-whisper. Use when Claude needs to convert local audio files (MP3, M4A, WAV, etc.) into text transcripts with or without timestamps, generate subtitle files from audio, or perform speech-to-text transcription. Supports multiple Whisper model sizes for accuracy/speed tradeoffs, language specification, VAD filtering, and custom output directories.
---

# Local Audio Transcription

Transcribe local audio files to SRT subtitles and TXT transcripts using faster-whisper speech-to-text.

## Quick Start

Transcribe an audio file:

```bash
scripts/transcribe_audio.py /path/to/audio.mp3
```

This generates `audio.srt` and `audio.txt` in the same directory as the audio file.

**Note**: If you get an onnxruntime error or are using Python 3.14+, use:

```bash
scripts/transcribe_audio.py /path/to/audio.mp3 --no-vad
```

## Command-Line Options

### Positional Arguments

- `audio_path` - Path to local audio file

### Optional Arguments

- `--output-dir PATH` - Output directory (default: same as audio file)
- `--model SIZE` - Whisper model size: `tiny` (default), `base`, `small`, `medium`, `large`, `large-v1`, `large-v2`, `large-v3`
- `--language CODE` - Force language code (e.g., `zh`, `en`) or allow auto-detect
- `--no-vad` - Disable VAD (Voice Activity Detection) filtering
- `--device {auto,cpu,cuda}` - Device for transcription (default: `auto`)

## Common Workflows

### Higher accuracy transcription

```bash
scripts/transcribe_audio.py episode.m4a --model small
```

### Force Chinese language

```bash
scripts/transcribe_audio.py interview.mp3 --language zh
```

### Custom output directory

```bash
scripts/transcribe_audio.py recording.wav --output-dir ~/Documents/transcripts
```

### Disable VAD for quiet audio

```bash
scripts/transcribe_audio.py quiet_podcast.mp3 --no-vad
```

### Use GPU for faster processing

```bash
scripts/transcribe_audio.py long_audio.mp3 --device cuda --model base
```

## Output Files

Generates two files:

1. **`{basename}.srt`** - SubRip subtitle file
   - Format: `sequence_number`, `start_time --> end_time`, `subtitle_text`
   - Compatible with video players and subtitle editors

2. **`{basename}.txt`** - Plain text transcript with timestamps
   - Format: `[HH:MM:SS,mmm] transcript_text`
   - Human-readable with embedded timestamps

## Model Selection

Trade off between speed and accuracy:

| Model | Speed | Accuracy | Use Case |
|-------|-------|----------|----------|
| tiny | ⚡️ Fastest | ✗ Lowest | Quick drafts, testing |
| base | 🚀 Fast | ✓ Good | Everyday use |
| small | ⏱️ Medium | ✓✓ Better | Production quality |
| medium | 🐢 Slow | ✓✓✓ High | Important content |
| large | 🐌 Slowest | ✓✓✓✗ Best | Critical accuracy needed |

## Dependencies

Requires:
- `faster-whisper` - Whisper speech-to-text implementation
- `torch` - For device detection (optional, for GPU/CUDA)
- `onnxruntime` - For VAD filtering (optional, Python 3.13 and earlier only)

### Setup

**Option 1: Virtual environment (recommended)**

```bash
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install faster-whisper torch

# For VAD support (Python 3.13 and earlier only)
pip install 'onnxruntime<2,>=1.14'
```

**Option 2: System-wide installation**

```bash
pip install --user faster-whisper torch
pip install --user 'onnxruntime<2,>=1.14'  # Python 3.13 and earlier only
```

**For detailed setup instructions and troubleshooting**, see [setup.md](references/setup.md).

### Important: Python Version Compatibility

- **Python 3.14+**: VAD filtering is not yet supported (onnxruntime incompatible). Use `--no-vad` flag.
- **Python 3.13 and earlier**: Full VAD support available after installing onnxruntime.

## How It Works

1. Load Whisper model with specified size and device
2. Transcribe audio with language detection and VAD filtering
3. Generate SRT file with subtitle timestamps
4. Generate TXT file with readable transcript

## Important Notes

- **Audio formats**: Supports all formats accepted by ffmpeg (MP3, M4A, WAV, FLAC, etc.)
- **Language detection**: Auto-detection works well in most cases; use `--language` for better accuracy
- **VAD filtering**: Removes silence segments by default when onnxruntime is installed
  - Automatically disabled if onnxruntime is not available
  - Use `--no-vad` to explicitly disable (recommended for Python 3.14+)
- **GPU support**: Automatically uses CUDA if available; install PyTorch with CUDA support for GPU acceleration
- **Python 3.14+ compatibility**: VAD filtering not supported; use `--no-vad` flag or downgrade to Python 3.13 for VAD support