# video-to-markdown

> Converts video files (mp4, mkv, webm, avi) to Markdown documents using speech recognition.
Extracts audio with ffmpeg, transcribes with OpenAI Whisper, and generates structured Markdown
with timestamps. Use when user wants to transcribe video, convert video to text, generate
video transcript, or create documentation from video content.

- Author: nblog
- Repository: nblog/video2doc
- Version: 20260129171015
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/nblog/video2doc
- Web: https://mule.run/skillshub/@@nblog/video2doc~video-to-markdown:20260129171015

---

---
name: video-to-markdown
description: |
  Converts video files (mp4, mkv, webm, avi) to Markdown documents using speech recognition.
  Extracts audio with ffmpeg, transcribes with OpenAI Whisper, and generates structured Markdown
  with timestamps. Use when user wants to transcribe video, convert video to text, generate
  video transcript, or create documentation from video content.
compatibility: Requires ffmpeg, Python 3.12+, CUDA GPU recommended for faster processing
metadata:
  author: video2doc
  version: "1.0"
---

# Video to Markdown Conversion

Convert video files to structured Markdown documents with timestamps using speech recognition.

## When to Use

- User wants to transcribe a video file
- User needs to convert video content to text documentation
- User wants to create meeting notes from recorded video
- User asks to extract speech/dialogue from video

## Workflow

```
Video (mp4/mkv/webm/avi)
    ↓ [ffmpeg - audio extraction]
Audio (16kHz mono WAV)
    ↓ [Whisper - speech recognition]
Transcription with timestamps
    ↓ [formatting]
Markdown document
```

## Prerequisites

> **Tip: Environment Setup**
> - **uv**: See [uv installation](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) for quickstart

1. **ffmpeg** - For audio extraction
2. **Python 3.12+** with uv package manager
3. **CUDA GPU** (recommended) - For faster transcription
4. **openai-whisper** package

## Step-by-Step Instructions

### 1. Check Environment

```bash
# Verify ffmpeg
ffmpeg -version

# Verify GPU (if available)
nvidia-smi --query-gpu=name,memory.total --format=csv
```

### 2. Setup Project

```bash
# Initialize uv project
uv init --python 3.12

# Install whisper with CUDA support
# Configure pyproject.toml with pytorch-cu126 index
uv add openai-whisper torch
```

See [pyproject.toml template](references/pyproject-template.toml) for CUDA configuration.

### 3. Run Conversion

```bash
uv run python main.py "video.mp4" -l zh -m large-v3
```

### 4. Output Format

The generated Markdown includes:
- Document header with metadata (generation time, duration, language)
- Transcribed content with timestamps `[HH:MM:SS → HH:MM:SS]`
- Timeline table appendix

## Model Selection Guide

| Model | VRAM | Speed | Accuracy | Recommended For |
|-------|------|-------|----------|-----------------|
| tiny | ~1GB | ★★★★★ | ★ | Quick previews |
| base | ~1GB | ★★★★ | ★★ | Draft transcripts |
| small | ~2GB | ★★★ | ★★★ | General use |
| medium | ~5GB | ★★ | ★★★★ | Good quality |
| large-v3 | ~10GB | ★ | ★★★★★ | Best accuracy |

## Language Codes

Common codes for `-l` parameter:
- `zh` - Chinese
- `en` - English
- `ja` - Japanese
- `ko` - Korean
- `auto` - Auto-detect (default)

## Troubleshooting

### CUDA Not Available

If PyTorch shows `CUDA available: False`:

1. Check CUDA installation: `echo $env:CUDA_PATH`
2. Reinstall torch with CUDA index in pyproject.toml
3. Delete `uv.lock` and run `uv sync`

### Triton Warning on Windows

```
UserWarning: Failed to launch Triton kernels...
```

This is expected on Windows. Triton only supports Linux. The warning does not affect transcription quality.

### Model Download Fails

If SHA256 checksum fails:
1. Delete corrupted model: `~/.cache/whisper/<model>.pt`
2. Retry with stable network connection
3. Consider using smaller model first

## Example Output

See [example output](references/example-output.md) for sample Markdown structure.

## CLI Reference

```
usage: main.py [-h] [-o OUTPUT] [-m MODEL] [-l LANGUAGE] video

positional arguments:
  video                 Input video file path

options:
  -o, --output          Output Markdown file path (default: same as video)
  -m, --model           Whisper model (tiny/base/small/medium/large-v3)
  -l, --language        Language code (zh/en/ja/ko or auto)
```