# audio-transcribe

> Transcribe audio files with speaker diarization using MLX-Whisper + pyannote-audio. Generates speaker-attributed transcripts and meeting analysis.

- Author: Baz Hand
- Repository: bazhand/ai-agent-skills
- Version: 20260126213349
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/bazhand/ai-agent-skills
- Web: https://mule.run/skillshub/@@bazhand/ai-agent-skills~audio-transcribe:20260126213349

---

---
name: audio-transcribe
description: Transcribe audio files with speaker diarization using MLX-Whisper + pyannote-audio. Generates speaker-attributed transcripts and meeting analysis.
license: MIT
context: main
compatibility: Requires Python 3.11+, MLX-Whisper, pyannote-audio, HuggingFace account
allowed-tools:
  - Read
  - Write
  - Bash
  - AskUserQuestion
---

# Audio Transcribe

End-to-end audio → transcript → meeting analysis pipeline.

## When to Use

Trigger when user asks to:
- "Transcribe this audio"
- "Transcribe the Hyprnote session"
- "Generate transcript with speakers"
- "Diarize this audio file"
- "Analyze this meeting"
- "Extract action items from this call"
- "Process this transcript"

---

## Setup & Requirements

### Python Environment

```bash
# Create environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install mlx-whisper pyannote-audio>=3.0 torch soundfile
```

### HuggingFace Setup (Required for pyannote)

pyannote models require HuggingFace authentication and model license acceptance:

```bash
# 1. Login to HuggingFace
huggingface-cli login

# 2. Accept model licenses at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0

# 3. First run will download models (~1GB)
```

### Hardware Acceleration

- **MPS (Metal)**: pyannote uses MPS for GPU acceleration on Apple Silicon
- **MLX**: Whisper uses MLX native acceleration
- **Memory**: Requires ~8GB+ unified memory for large-v3 model

---

## Architecture

| Component | Purpose |
|-----------|---------|
| MLX-Whisper | Transcription (3-5x faster on Apple Silicon) |
| pyannote-audio 3.1 | Speaker diarization (voice embedding based) |
| large-v3 model | Default Whisper model for best accuracy |
| 8-section framework | Structured meeting analysis |

## Input Formats

**Audio files:** WAV, MP3, M4A, FLAC

**Existing transcripts:** Speaker-attributed markdown, VTT, JSON, JSONL

## Output Files

| File | Purpose |
|------|---------|
| `*_transcript.md` | Speaker-attributed dialogue |
| `*_summary.md` | 8-section meeting analysis |

**Filename pattern**: `{date}_{first_name}-{company}_{topic}_{suffix}.md`

---

## Workflow

### Phase 1: Transcription (if audio input)

1. **Identify audio file** - User provides path, verify format support

2. **Check for speaker hints** - Ask user about speaker names if known

3. **Run diarization + transcription**
   ```bash
   source .venv/bin/activate
   python transcribe.py [--speakers "1=Name,2=Name"] /path/to/audio.wav
   ```

4. **Verify speaker attribution** - Show first few utterances to user

### Phase 2: Manual Cleanup (IMPORTANT)

5. **Check for Whisper hallucinations** - Look for repeated phrases like "yeah yeah yeah"

6. **Clean the transcript manually**
   - Replace "Speaker N" with actual names
   - Remove hallucinated repetitions
   - Smooth transitions

### Phase 3: Analysis

7. **Generate 8-section analysis** - See [references/analysis-framework.md](references/analysis-framework.md)

8. **Return paths to both files**

---

## Transcription Commands

### Basic
```bash
source .venv/bin/activate
python transcribe.py /path/to/audio.wav
```

### With Speaker Names
```bash
python transcribe.py --speakers "1=Elena,2=Baz" /path/to/audio.wav
```

### Different Model
```bash
python transcribe.py --model medium /path/to/audio.wav
```

## Model Options

| Model | Speed | Quality | When to Use |
|-------|-------|---------|-------------|
| tiny | Fastest | Basic | Quick previews |
| base | Fast | OK | Testing |
| small | Medium | Good | Balance |
| medium | Slower | Better | Most uses |
| large-v3 | Slowest | Best | Final output (default) |

---

## Known Issues & Solutions

### Whisper Hallucination (CRITICAL)

**Problem**: Whisper hallucinates repetitive text during silence:
- "yeah yeah yeah yeah yeah"
- "absolutely absolutely absolutely"

**Solution**: Manual cleanup required. This is a fundamental Whisper limitation.

**Cleanup pattern**:
```markdown
# Before (raw output)
**[04:54] Speaker 1**: ...meeting all top managers. channel channel channel

# After (cleaned)
**[04:54] Elena**: ...meeting all top managers and distribution channels.
```

### Speaker Count Mismatch

**Problem**: Diarization detects more speakers than expected

**Solution**: Identify which speakers are the same person based on content, merge during cleanup

### torchcodec Warning

**Message**: "torchcodec is not installed correctly..."

**Impact**: None - script uses soundfile as fallback. Ignore this warning.

---

## Identifying Speakers from Content

When speaker numbers don't match names, analyze content for clues:
- Who introduces themselves or others?
- Who asks vs answers questions?
- Native vs non-native English patterns
- Who mentions "we" referring to a specific company?

---

## Quality Checklist

**Before finalizing:**
- [ ] Speaker attribution verified with user
- [ ] All participants identified by name
- [ ] Every action item has an owner
- [ ] Decisions distinguished from discussions
- [ ] Dates use ISO format (YYYY-MM-DD)
- [ ] Executive summary reflects true priorities

---

## Additional Resources

- **[references/analysis-framework.md](references/analysis-framework.md)** - 8-section meeting analysis framework