# audio-transcribe > Transcribe audio files with speaker diarization using MLX-Whisper + pyannote-audio. Generates speaker-attributed transcripts and meeting analysis. - Author: Baz Hand - Repository: bazhand/ai-agent-skills - Version: 20260126213349 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/bazhand/ai-agent-skills - Web: https://mule.run/skillshub/@@bazhand/ai-agent-skills~audio-transcribe:20260126213349 --- --- name: audio-transcribe description: Transcribe audio files with speaker diarization using MLX-Whisper + pyannote-audio. Generates speaker-attributed transcripts and meeting analysis. license: MIT context: main compatibility: Requires Python 3.11+, MLX-Whisper, pyannote-audio, HuggingFace account allowed-tools: - Read - Write - Bash - AskUserQuestion --- # Audio Transcribe End-to-end audio → transcript → meeting analysis pipeline. ## When to Use Trigger when user asks to: - "Transcribe this audio" - "Transcribe the Hyprnote session" - "Generate transcript with speakers" - "Diarize this audio file" - "Analyze this meeting" - "Extract action items from this call" - "Process this transcript" --- ## Setup & Requirements ### Python Environment ```bash # Create environment python3 -m venv .venv source .venv/bin/activate # Install dependencies pip install mlx-whisper pyannote-audio>=3.0 torch soundfile ``` ### HuggingFace Setup (Required for pyannote) pyannote models require HuggingFace authentication and model license acceptance: ```bash # 1. Login to HuggingFace huggingface-cli login # 2. Accept model licenses at: # https://huggingface.co/pyannote/speaker-diarization-3.1 # https://huggingface.co/pyannote/segmentation-3.0 # 3. First run will download models (~1GB) ``` ### Hardware Acceleration - **MPS (Metal)**: pyannote uses MPS for GPU acceleration on Apple Silicon - **MLX**: Whisper uses MLX native acceleration - **Memory**: Requires ~8GB+ unified memory for large-v3 model --- ## Architecture | Component | Purpose | |-----------|---------| | MLX-Whisper | Transcription (3-5x faster on Apple Silicon) | | pyannote-audio 3.1 | Speaker diarization (voice embedding based) | | large-v3 model | Default Whisper model for best accuracy | | 8-section framework | Structured meeting analysis | ## Input Formats **Audio files:** WAV, MP3, M4A, FLAC **Existing transcripts:** Speaker-attributed markdown, VTT, JSON, JSONL ## Output Files | File | Purpose | |------|---------| | `*_transcript.md` | Speaker-attributed dialogue | | `*_summary.md` | 8-section meeting analysis | **Filename pattern**: `{date}_{first_name}-{company}_{topic}_{suffix}.md` --- ## Workflow ### Phase 1: Transcription (if audio input) 1. **Identify audio file** - User provides path, verify format support 2. **Check for speaker hints** - Ask user about speaker names if known 3. **Run diarization + transcription** ```bash source .venv/bin/activate python transcribe.py [--speakers "1=Name,2=Name"] /path/to/audio.wav ``` 4. **Verify speaker attribution** - Show first few utterances to user ### Phase 2: Manual Cleanup (IMPORTANT) 5. **Check for Whisper hallucinations** - Look for repeated phrases like "yeah yeah yeah" 6. **Clean the transcript manually** - Replace "Speaker N" with actual names - Remove hallucinated repetitions - Smooth transitions ### Phase 3: Analysis 7. **Generate 8-section analysis** - See [references/analysis-framework.md](references/analysis-framework.md) 8. **Return paths to both files** --- ## Transcription Commands ### Basic ```bash source .venv/bin/activate python transcribe.py /path/to/audio.wav ``` ### With Speaker Names ```bash python transcribe.py --speakers "1=Elena,2=Baz" /path/to/audio.wav ``` ### Different Model ```bash python transcribe.py --model medium /path/to/audio.wav ``` ## Model Options | Model | Speed | Quality | When to Use | |-------|-------|---------|-------------| | tiny | Fastest | Basic | Quick previews | | base | Fast | OK | Testing | | small | Medium | Good | Balance | | medium | Slower | Better | Most uses | | large-v3 | Slowest | Best | Final output (default) | --- ## Known Issues & Solutions ### Whisper Hallucination (CRITICAL) **Problem**: Whisper hallucinates repetitive text during silence: - "yeah yeah yeah yeah yeah" - "absolutely absolutely absolutely" **Solution**: Manual cleanup required. This is a fundamental Whisper limitation. **Cleanup pattern**: ```markdown # Before (raw output) **[04:54] Speaker 1**: ...meeting all top managers. channel channel channel # After (cleaned) **[04:54] Elena**: ...meeting all top managers and distribution channels. ``` ### Speaker Count Mismatch **Problem**: Diarization detects more speakers than expected **Solution**: Identify which speakers are the same person based on content, merge during cleanup ### torchcodec Warning **Message**: "torchcodec is not installed correctly..." **Impact**: None - script uses soundfile as fallback. Ignore this warning. --- ## Identifying Speakers from Content When speaker numbers don't match names, analyze content for clues: - Who introduces themselves or others? - Who asks vs answers questions? - Native vs non-native English patterns - Who mentions "we" referring to a specific company? --- ## Quality Checklist **Before finalizing:** - [ ] Speaker attribution verified with user - [ ] All participants identified by name - [ ] Every action item has an owner - [ ] Decisions distinguished from discussions - [ ] Dates use ISO format (YYYY-MM-DD) - [ ] Executive summary reflects true priorities --- ## Additional Resources - **[references/analysis-framework.md](references/analysis-framework.md)** - 8-section meeting analysis framework