# mlx-audio > Generate speech from text (TTS), transcribe audio to text (STT), and run speech-to-speech pipelines using MLX on Apple Silicon. Use when working with audio generation, voice synthesis, speech transcription, voice cloning, or audio processing on Mac with MLX framework. - Author: Michael A - Repository: IntelliSoins/mlx-audio-skill - Version: 20260126221724 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/IntelliSoins/mlx-audio-skill - Web: https://mule.run/skillshub/@@IntelliSoins/mlx-audio-skill~mlx-audio:20260126221724 --- --- name: mlx-audio description: Generate speech from text (TTS), transcribe audio to text (STT), and run speech-to-speech pipelines using MLX on Apple Silicon. Use when working with audio generation, voice synthesis, speech transcription, voice cloning, or audio processing on Mac with MLX framework. --- # MLX-Audio Audio processing library optimized for Apple Silicon using Apple's MLX framework. Provides TTS (text-to-speech), STT (speech-to-text), and STS (speech-to-speech) capabilities with streaming support, voice cloning, and an OpenAI-compatible REST API. **Requires**: Python 3.10+, Apple Silicon Mac, ffmpeg (`brew install ffmpeg`) ## Quick Start ### Text-to-Speech (TTS) ```python from mlx_audio.tts.utils import load_model from mlx_audio.tts.generate import generate_audio model = load_model("mlx-community/Kokoro-82M-bf16") generate_audio( model=model, text="Hello, world!", voice="af_heart", speed=1.0, lang_code="a", play=True, output_path="./output" ) ``` CLI equivalent: ```bash python -m mlx_audio.tts.generate \ --model mlx-community/Kokoro-82M-bf16 \ --text "Hello, world!" \ --voice af_heart \ --play ``` ### Speech-to-Text (STT) ```python from mlx_audio.stt.generate import generate_transcription result = generate_transcription( model="mlx-community/whisper-large-v3-turbo-asr-fp16", audio="audio.wav", output_path="transcript", format="json", verbose=True ) print(result.text) ``` CLI equivalent: ```bash python -m mlx_audio.stt.generate \ --model mlx-community/whisper-large-v3-turbo-asr-fp16 \ --audio audio.wav \ --output-path transcript ``` ### Speech-to-Speech Pipeline ```python from mlx_audio.sts.voice_pipeline import VoicePipeline pipeline = VoicePipeline( stt_model="mlx-community/whisper-large-v3-turbo-asr-fp16", llm_model="Qwen/Qwen2.5-0.5B-Instruct-4bit", tts_model="mlx-community/csm-1b-fp16" ) await pipeline.start() ``` ## Task Workflows ### Generate Speech from Text 1. Choose a TTS model based on requirements (see [references/models_and_voices.md](references/models_and_voices.md)) - **Fast/lightweight**: Kokoro (`mlx-community/Kokoro-82M-bf16`) - **Voice cloning**: CSM (`mlx-community/csm-1b-fp16`) or Qwen3-TTS - **Multilingual**: Chatterbox (16 languages) or Bark 2. Load the model with `load_model()` 3. Call `generate_audio()` with text, voice, and output parameters 4. Key parameters: `text`, `voice`, `speed`, `lang_code`, `play`, `output_path`, `streaming_interval` ### Transcribe Audio 1. Choose an STT model: - **General purpose**: Whisper (`mlx-community/whisper-large-v3-turbo-asr-fp16`) - **English-only high accuracy**: Parakeet - **With diarization**: VibeVoice-ASR 2. Call `generate_transcription()` with audio path and model 3. Output formats: `txt`, `srt`, `vtt`, `json` ### Convert and Quantize Models ```bash python -m mlx_audio.convert \ --hf-path \ --mlx-path ./output \ --quantize \ --q-bits 4 ``` Supported quantization: 3-bit, 4-bit, 6-bit, 8-bit. Data types: float16, bfloat16, float32. ### Run the REST API Server ```bash python -m mlx_audio.server --host 0.0.0.0 --port 8000 ``` OpenAI-compatible endpoints. See [references/server_api.md](references/server_api.md) for full API reference. ## Key Modules | Module | Import Path | Purpose | |--------|-------------|---------| | TTS generation | `mlx_audio.tts.generate` | `generate_audio()` | | TTS model loading | `mlx_audio.tts.utils` | `load_model()` | | STT generation | `mlx_audio.stt.generate` | `generate_transcription()` | | STS pipeline | `mlx_audio.sts.voice_pipeline` | `VoicePipeline` | | Audio I/O | `mlx_audio.audio_io` | Read/write WAV, MP3, FLAC, OGG, M4A | | DSP utilities | `mlx_audio.dsp` | STFT, mel filterbanks, window functions | | Server | `mlx_audio.server` | FastAPI OpenAI-compatible API | | Conversion | `mlx_audio.convert` | Model quantization and format conversion | ## Installation ```bash pip install mlx-audio # Core pip install "mlx-audio[tts]" # + TTS dependencies pip install "mlx-audio[stt]" # + STT dependencies pip install "mlx-audio[sts]" # + Speech-to-speech pip install "mlx-audio[server]" # + FastAPI server pip install "mlx-audio[all]" # Everything ``` For development from source: ```bash git clone https://github.com/Blaizzy/mlx-audio.git cd mlx-audio pip install -e ".[dev]" ``` ## Audio Format Support - **Read**: WAV, MP3, FLAC, OGG/Vorbis, M4A/AAC (auto-detected via magic bytes) - **Write**: WAV, MP3, FLAC (MP3/FLAC require ffmpeg) ## Resources - **[references/models_and_voices.md](references/models_and_voices.md)**: Complete list of supported TTS, STT, STS models and Kokoro voice presets - **[references/server_api.md](references/server_api.md)**: REST API endpoint documentation