# doubao-asr > Transcribes audio to text using Doubao (ByteDance) ASR API. Supports streaming recognition, WAV/MP3 formats, ITN, punctuation, and Chinese/English speech. Use when converting speech to text, transcribing audio files, or ASR tasks. 豆包语音识别,语音转文字,音频转录。 - Author: geekjourneyx - Repository: geekjourneyx/doubao-skills - Version: 20260207155250 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/geekjourneyx/doubao-skills - Web: https://mule.run/skillshub/@@geekjourneyx/doubao-skills~doubao-asr:20260207155250 --- --- name: doubao-asr description: Transcribes audio to text using Doubao (ByteDance) ASR API. Supports streaming recognition, WAV/MP3 formats, ITN, punctuation, and Chinese/English speech. Use when converting speech to text, transcribing audio files, or ASR tasks. 豆包语音识别,语音转文字,音频转录。 metadata: {"openclaw": {"emoji": "🎧", "homepage": "https://www.volcengine.com/docs/6561/80816", "requires": {"bins": ["python3"], "env": ["DOUBAO_APPID", "DOUBAO_TOKEN", "DOUBAO_CLUSTER"]}, "primaryEnv": "DOUBAO_TOKEN"}} --- # Doubao ASR (Automatic Speech Recognition) Transcribes audio files to text using ByteDance Volcano Engine ASR WebSocket API. ## Prerequisites Set environment variables: ```bash export DOUBAO_APPID="your-appid" export DOUBAO_TOKEN="your-access-token" export DOUBAO_CLUSTER="your-cluster" ``` Install dependencies: ```bash pip install websockets ``` ## Quick Start ```bash python {baseDir}/scripts/asr.py --audio_path recording.wav ``` ## Parameters | Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | `--audio_path` | Yes | - | Path to audio file | | `--format` | No | `wav` | Audio format: wav, mp3, raw, ogg | | `--language` | No | `zh-CN` | Language code | | `--workflow` | No | `full` | Processing workflow | | `--show_utterances` | No | `false` | Show detailed utterance info | ## Usage Examples **Basic transcription:** ```bash python {baseDir}/scripts/asr.py --audio_path speech.wav ``` **MP3 file:** ```bash python {baseDir}/scripts/asr.py --audio_path speech.mp3 --format mp3 ``` **With all post-processing (ITN, punctuation, smoothing):** ```bash python {baseDir}/scripts/asr.py \ --audio_path speech.wav \ --workflow full ``` **Show detailed utterance information:** ```bash python {baseDir}/scripts/asr.py \ --audio_path speech.wav \ --show_utterances ``` ## Workflow Options | Workflow | Description | |----------|-------------| | `default` | Basic recognition only | | `itn` | Enable ITN (Inverse Text Normalization) | | `punctuate` | Enable punctuation | | `smooth` | Enable smoothing | | `full` | Enable all: ITN + punctuation + smoothing | For detailed workflow configuration, see [WORKFLOW.md]({baseDir}/WORKFLOW.md). ## Output Format ```json { "text": "Transcribed text here", "utterances": [ { "text": "Sentence text", "start_time": 0, "end_time": 2500 } ] } ``` ## Error Handling | Code | Description | Solution | |------|-------------|----------| | 1000 | Success | - | | 1001 | Invalid parameters | Check request format | | 1002 | Authentication failed | Verify token | | 1003 | Rate limit exceeded | Reduce request frequency | | 1010 | Audio too long | Use shorter audio clips | | 1012 | Invalid audio format | Check audio file format | | 1013 | Silent audio | No speech detected in audio | | 1020 | Recognition timeout | Retry the request | ## Audio Requirements | Property | Requirement | |----------|-------------| | Sample Rate | 16000 Hz (default), 8000 Hz supported | | Channels | Mono (1 channel) | | Bit Depth | 16 bits | | Format | WAV, MP3, OGG, RAW PCM | | Duration | Recommended < 60 seconds | ## API Reference - Endpoint: `wss://openspeech.bytedance.com/api/v2/asr` - Protocol: WebSocket with binary frames - Authentication: `Authorization: Bearer; {token}`