# transcribe-audio > Convert audio files to text using faster-whisper. Supports MP3, MP4, WAV, M4A and other formats. Optimized for speed with multiple model options. - Author: ethan - Repository: cgt1994/lesson-summary - Version: 20260128110435 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/cgt1994/lesson-summary - Web: https://mule.run/skillshub/@@cgt1994/lesson-summary~transcribe-audio:20260128110435 --- --- name: transcribe-audio description: Convert audio files to text using faster-whisper. Supports MP3, MP4, WAV, M4A and other formats. Optimized for speed with multiple model options. argument-hint: [input-file] [--model MODEL] [--language LANG] [--output-dir DIR] disable-model-invocation: true allowed-tools: Bash(python:*) --- # Transcribe Audio to Text Convert audio files (MP3, MP4, WAV, M4A, etc.) to text using faster-whisper (optimized Whisper implementation). ## Prerequisites faster-whisper must be installed: ```bash pip install -U faster-whisper ``` **Note:** FFmpeg is also required and should already be installed if you're using the convert-to-mp3 skill. ## How to use this skill Basic usage (auto-detect language): ```bash python scripts/faster_whisper_test.py "$1" --model tiny --output-dir "${2:-.}" ``` With language specified: ```bash python scripts/faster_whisper_test.py "$1" --model tiny --language ${2:-auto} --output-dir "${3:-.}" ``` ## Parameters - `$1` - Input audio file path (required) - `--model` - Model size: tiny, base, small, medium, large, large-v3 (default: tiny) - `--language` - Source language: en, zh, ja, etc. (auto-detect if not specified) - `--output-dir` - Output directory (defaults to same directory as input) ## Available Models | Model | Size | Speed | Accuracy | Recommended For | |-------|------|-------|----------|-----------------| | **tiny** | 75MB | **Fastest** (40-60x) | Basic | Quick drafts, previews | | **base** | 142MB | Fast (11-18x) | Good | General use, balanced | | **small** | 466MB | Medium (8-12x) | Better | Important content | | **medium** | 1.5GB | Slow (4-6x) | Very Good | High accuracy needed | | **large-v3** | 3GB | Slowest (2-4x) | **Best** | Professional transcription | ## Examples **Quick transcription (tiny model):** ``` /transcribe-audio lesson.mp3 ``` Output: 53 min audio → ~1-2 minutes **Better accuracy (base model):** ``` /transcribe-audio lesson.mp3 --model base ``` Output: 53 min audio → ~5 minutes **Chinese content:** ``` /transcribe-audio 课程.mp3 --language zh --model base ``` **High accuracy (medium model):** ``` /transcribe-audio interview.mp3 --model medium ``` Output: 53 min audio → ~8 minutes **Best quality (large-v3):** ``` /transcribe-audio important.mp3 --model large-v3 ``` Note: First use will download 3GB model ## Language Codes Common languages: - `en` - English - `zh` - Chinese (Mandarin) - `ja` - Japanese - `ko` - Korean - `es` - Spanish - `fr` - French - Leave blank for auto-detection ## Speed Comparison For a 53-minute audio file: - **tiny**: ~80 seconds (40x real-time) - **base**: ~280 seconds (11x real-time) - **medium**: ~480 seconds (7x real-time) - **large-v3**: ~800 seconds (4x real-time) ## Accuracy Tips 1. **For mixed Chinese-English**: Use auto-detection or specify primary language 2. **For noisy audio**: Use base or higher model 3. **For lectures/lessons**: base model is usually sufficient 4. **For professional work**: Use medium or large-v3 ## Output Format Generated files include: - Plain text transcript (.txt) - Timestamps for each segment - Auto-detected language info - Processing speed metrics Example output: ``` ✓ Transcription completed! Input: lesson.mp3 (53:28) Output: lesson.txt (32.2 KB) Model: tiny Language: Chinese (auto-detected) Time: 80 seconds Speed: 40x real-time Location: /Users/ethan/Downloads/lesson.txt ``` ## Workflow Integration **Complete workflow: Video → Text → Email** ```bash # Step 1: Convert video to audio /convert-to-mp3 lesson.mp4 # Step 2: Transcribe audio /transcribe-audio lesson.mp3 --model base # Step 3: Generate summary email /generate-email lesson.txt --type summary --to students ``` ## Advanced Options The script uses optimized parameters: - `beam_size=1` - Faster decoding - `best_of=1` - Single pass - `temperature=0` - Deterministic output - `vad_filter=true` - Remove silence - `condition_on_previous_text=false` - Faster processing ## Model Download Models are auto-downloaded on first use: - Cached at: `~/.cache/huggingface/hub/` - Only download once - Can delete cache to free space ## Error Handling Common issues: - **"faster-whisper not installed"**: Run `pip install faster-whisper` - **"FFmpeg not found"**: Install FFmpeg first - **"Out of memory"**: Use smaller model (tiny or base) - **"File not found"**: Check file path ## Performance Tips 1. **Start with tiny**: Test with tiny model first 2. **Upgrade if needed**: If accuracy is poor, try base or medium 3. **Batch processing**: Process multiple files in sequence 4. **Use appropriate model**: Don't use large-v3 unless necessary ## Use Cases - **Teachers**: Transcribe lesson recordings - **Students**: Convert lecture audio to text - **Professionals**: Meeting transcriptions - **Content Creators**: Video/podcast transcripts - **Researchers**: Interview transcriptions