# align-captions > Align script text to audio with karaoke-style word timestamps using Qwen3-ForcedAligner + jieba - Author: Jiao Dong - Repository: Nuva-Lab/vibecut - Version: 20260131095536 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Nuva-Lab/vibecut - Web: https://mule.run/skillshub/@@Nuva-Lab/vibecut~align-captions:20260131095536 --- --- name: align-captions description: Align script text to audio with karaoke-style word timestamps using Qwen3-ForcedAligner + jieba --- # align-captions Align existing script text to audio for karaoke-style captions. Uses Qwen3-ForcedAligner-0.6B for ~30ms timestamp precision and jieba for proper Chinese word segmentation. ## Pipeline ``` Script + Audio ↓ Qwen3-ForcedAligner (character-level timestamps) ↓ Jieba word segmentation (characters → Chinese words) ↓ Position-based phrase matching (words → phrases) ↓ Output: phrases with embedded word timestamps ``` ## Usage ```bash # Align script to audio (phrase-level output with word timestamps) python skills/align-captions/align.py voiceover.wav --script "当全世界都在追AI的时候..." # Save to file python skills/align-captions/align.py voiceover.wav --script "..." --output captions.json # Word-level only (no phrase grouping) python skills/align-captions/align.py voiceover.wav --script "..." --word-level ``` ## Output Format Designed for Remotion karaoke rendering: ```json { "segments": [ { "text": "当全世界都在追AI的时候,", "startMs": 240, "endMs": 2080, "words": [ {"text": "当", "startMs": 240, "endMs": 400}, {"text": "全世界", "startMs": 400, "endMs": 880}, {"text": "都", "startMs": 880, "endMs": 960}, {"text": "在", "startMs": 960, "endMs": 1120}, {"text": "追", "startMs": 1120, "endMs": 1280}, {"text": "AI", "startMs": 1280, "endMs": 1600}, {"text": "的", "startMs": 1600, "endMs": 1680}, {"text": "时候", "startMs": 1680, "endMs": 2080} ] } ], "word_segments": [...], // All words flat "language": "Chinese", "model": "Qwen3-ForcedAligner-0.6B" } ``` ## Key Feature: Chinese Word Segmentation Uses jieba to group characters into proper Chinese words: | Without jieba | With jieba | |---------------|------------| | 当-全-世-界 | 当-全世界 | | (too fast) | (natural pace) | This is critical for karaoke-style captions to feel natural. ## Programmatic Usage ```python from align import align_captions # Get phrases with embedded word timestamps result = align_captions( "voiceover.wav", script="当全世界都在追AI的时候...", language="Chinese" ) # Each phrase has a 'words' array for karaoke highlighting for phrase in result["segments"]: print(f"{phrase['text']}: {len(phrase['words'])} words") ``` ## Integration with make-video The `make_video.py` script automatically uses align-captions when: 1. A voiceover file exists 2. A script is provided in project.json 3. `caption_mode` is "auto" or "asr" The output is passed to Remotion's `RollingCaption` component for karaoke rendering. ## Notes - First run downloads Qwen3-ForcedAligner (~1GB) - Runs on CPU by default (quality first) - ~30ms timestamp precision (SOTA) - Supports 11 languages for alignment: Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish - Chinese word segmentation uses jieba (must be installed)