# qwen-voice

> Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram.

- Author: Elliot
- Repository: ada20204/qwen-voice
- Version: 20260131140029
- Stars: 2
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/ada20204/qwen-voice
- Web: https://mule.run/skillshub/@@ada20204/qwen-voice~qwen-voice:20260131140029

---

---
name: qwen-voice
description: "Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram."
---

# Qwen Voice (ASR + TTS)

Use the bundled scripts. Configure `DASHSCOPE_API_KEY` in one of:
- `~/.config/qwen-voice/.env` (recommended)
- `<repo>/.qwen-voice/.env` (dev/testing)

## ASR (speech → text)

### Non-timestamp (default)

```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg
```

### With timestamps (chunk-based)

```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg --timestamps --chunk-sec 3
```

Notes:
- Timestamps are generated by fixed-length chunking (not word-level alignment).
- Input audio is converted to mono 16kHz WAV before sending.

## TTS (text → speech)

### Preset voice (default: Cherry)

```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好，我是 Pi。' --voice Cherry --out /tmp/out.ogg
```

### Clone voice (create once, reuse)

1) Create a voice profile from a sample audio:

```bash
python3 skills/qwen-voice/scripts/qwen_voice_clone.py --in ./voice_sample.ogg --name george --out work/qwen-voice/george.voice.json
```

2) Use the cloned voice to synthesize:

```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好，我是 George。' --voice-profile work/qwen-voice/george.voice.json --out /tmp/out.ogg
```

Notes:
- `.ogg` output is Opus, suitable for Telegram voice messages.
- Voice cloning uses DashScope customization endpoint + Qwen realtime TTS model.
- Scripts use a local venv at `work/venv-dashscope` (auto-created on first run).

## Typical chat workflow

- When user sends voice message/audio: run ASR and reply with the transcribed text.
- When user explicitly asks for voice reply: run TTS and send the generated `.ogg` as a voice note.