# tts-broadcast-injection-issues

> Avoid broadcast speaker injection (injecting embeddings to ALL codec positions)
in multi-speaker TTS training. Use when: (1) Training codec-based TTS models
with multiple speakers, (2) Model generates to max_new_tokens instead of
stopping at EOS, (3) Audio duration is excessively long (e.g., 163 seconds
for short text). Fix: Use single-position injection (position-6) which
preserves EOS detection. More speaker conditioning is NOT always better.

- Author: sani
- Repository: khursanirevo/claude-config-sync
- Version: 20260205210002
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/khursanirevo/claude-config-sync
- Web: https://mule.run/skillshub/@@khursanirevo/claude-config-sync~tts-broadcast-injection-issues:20260205210002

---

---
name: tts-broadcast-injection-issues
description: |
  Avoid broadcast speaker injection (injecting embeddings to ALL codec positions)
  in multi-speaker TTS training. Use when: (1) Training codec-based TTS models
  with multiple speakers, (2) Model generates to max_new_tokens instead of
  stopping at EOS, (3) Audio duration is excessively long (e.g., 163 seconds
  for short text). Fix: Use single-position injection (position-6) which
  preserves EOS detection. More speaker conditioning is NOT always better.
author: Claude Code
version: 1.0.0
date: 2025-01-24
---

# TTS Broadcast Speaker Injection Issues

## Problem

Broadcast speaker injection (injecting speaker embeddings to ALL codec positions)
seems intuitive for multi-speaker TTS training, but it breaks the model's ability
to detect end-of-sequence (EOS) tokens, causing it to generate until hitting
`max_new_tokens` instead of stopping naturally.

## Context / Trigger Conditions

**When this issue occurs:**
- Training multi-speaker TTS models with codec-based architectures
- Using broadcast injection: `input_codec_embedding = input_codec_embedding + speaker_embedding.unsqueeze(1) * mask`
- During inference, audio generates for exactly `max_new_tokens / sample_rate` seconds
- For 12.5Hz codec with max_new_tokens=2048: always generates 163.6 seconds regardless of text length

**Symptoms:**
- Audio files are consistently 10-25x larger than expected
- All samples from a speaker have identical duration (e.g., exactly 163.60s)
- Training loss is higher than baseline
- Model ignores EOS token and generates to max limit

**Root Cause:**
Broadcasting speaker information across ALL codec positions disrupts the model's
learned relationship between codec embeddings and EOS detection. The model can no
longer properly learn when to stop generation.

## Solution

**Use single-position injection (position-6) instead:**

```python
# ❌ BROKEN: Broadcast injection breaks EOS detection
codec_mask_expanded = codec_mask.unsqueeze(-1).expand_as(input_codec_embedding)
input_codec_embedding = input_codec_embedding + speaker_embedding.unsqueeze(1) * codec_mask_expanded

# ✅ CORRECT: Single-position injection preserves EOS detection
input_codec_embedding[:, 6, :] = speaker_embedding
```

**Why position-6 works:**
- Preserves the codec embedding structure
- Allows model to learn proper EOS behavior
- Provides sufficient speaker conditioning without disrupting generation
- Lower training loss + correct generation behavior

## Verification

After switching to position-6 injection:
1. Training loss should decrease (e.g., 11.40 vs 12.74 for broadcast)
2. Generated audio duration should match text length (6-20s, not 163s)
3. Model should stop at EOS token, not max_new_tokens
4. Audio files should vary in size, not be identical

**Test command:**
```python
import soundfile as sf
audio, sr = sf.read("output.wav")
duration = len(audio) / sr
# Should be 5-30 seconds for normal speech, NOT 163+ seconds
```

## Example

**Training multi-speaker Malay TTS with 11,552 samples:**

| Strategy | Training Loss | Generation | Verdict |
|----------|--------------|------------|---------|
| Position-6 | 11.40 | 6-19s (correct) | ✅ Use this |
| Broadcast | 12.74 | 163.6s (broken) | ❌ Avoid |

**Sample results:**
```
norzaihan_1_position6.wav:  15.3s, 736 KB  ✅
norzaihan_1_broadcast.wav: 163.6s, 7.8 MB ❌ (10x too long)
```

## Notes

**Counterintuitive Result:** More speaker conditioning (broadcast) is WORSE than
minimal conditioning (position-6). This is because:
1. Codec embeddings encode BOTH audio features AND structural information
2. Overwriting all positions destroys the structural cues needed for EOS detection
3. Single-position injection provides speaker identity without breaking structure

**Related approaches that work:**
- Cross-attention conditioning (speaker embeddings via attention layers)
- Encoder-based injection (concatenating with encoder outputs)
- Single-token injection (position-6 or similar fixed position)

**When to use each:**
- **Codec-based models**: Use position-6 or single-token injection
- **Attention-based models**: Cross-attention conditioning may work better
- **Flow-based models**: Decoder input conditioning is standard

**Detection pattern:** If you see `Setting pad_token_id to eos_token_id` warnings
combined with all audio having identical maximum duration, you're likely hitting
this issue.

## References

- [FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect TTS](https://arxiv.org/html/2505.14351v1) - Multi-speaker training framework
- [Koel-TTS: LLM-based Speech Generation](https://aclanthology.org/2025.emnlp-main.1076.pdf) - Cross-attention speaker conditioning
- [YourTTS: Zero-Shot Multi-Speaker TTS](https://proceedings.mlr.press/v162/casanova22a/casanova22a.pdf) - Flow-based decoder conditioning
- [Deep Voice 2: Multi-Speaker Neural TTS](http://papers.neurips.cc/paper/6889-deep-voice-2-multi-speaker-neural-text-to-speech.pdf) - Low-dimensional speaker embeddings
- [Neural Codec Language Models are Zero-Shot TTS](https://www.researchgate.net/publication/388058656_Neural_Codec_Language_Models_are_Zero-Shot_Text_to_Speech_Synthesizers) - Codec-based TTS architecture