# convert-to-markdown > Convert documents and files to Markdown using markitdown with Windows/WSL path handling. Supports PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls), HTML, CSV, JSON, XML, images (with EXIF/OCR), audio (with transcription), ZIP archives, YouTube URLs, or EPubs. Use when converting files to markdown, processing Confluence exports, handling Windows/WSL path conversions, extracting images from PDFs, or working with markitdown utility. - Author: Your Name - Repository: nguyendinhquocx/code-ai - Version: 20260127194234 - Stars: 2 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/nguyendinhquocx/code-ai - Web: https://mule.run/skillshub/@@nguyendinhquocx/code-ai~convert-to-markdown:20260127194234 --- --- name: convert-to-markdown description: Convert documents and files to Markdown using markitdown with Windows/WSL path handling. Supports PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls), HTML, CSV, JSON, XML, images (with EXIF/OCR), audio (with transcription), ZIP archives, YouTube URLs, or EPubs. Use when converting files to markdown, processing Confluence exports, handling Windows/WSL path conversions, extracting images from PDFs, or working with markitdown utility. description_vi: Chuyển đổi tài liệu và file sang Markdown bằng markitdown với hỗ trợ đường dẫn Windows/WSL. Hỗ trợ PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls), HTML, CSV, JSON, XML, images (EXIF/OCR), audio (transcription), ZIP archives, YouTube URLs, hoặc EPubs. Dùng khi chuyển file sang markdown, xử lý Confluence exports, chuyển đổi path Windows/WSL, trích xuất images từ PDFs, hoặc làm việc với markitdown. keywords_vi: [markdown, convert, markitdown, pdf to markdown, word to markdown, chuyển đổi tài liệu, exif, ocr, transcription, windows, wsl] --- # Markdown Tools Convert documents to markdown using `markitdown` with support for multiple formats, image extraction, and Windows/WSL path handling. ## Quick Start ### Installation Options **Option 1: uvx (no installation required)** ```bash # Run directly without installing uvx markitdown input.pdf -o output.md ``` **Option 2: uv tool install (recommended for PDF support)** ```bash # Install with PDF support uv tool install "markitdown[pdf]" # Or via pip pip install "markitdown[pdf]" # Then use directly markitdown "document.pdf" -o output.md ``` ## Supported Formats - **Documents**: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls) - **Web/Data**: HTML, CSV, JSON, XML - **Media**: Images (EXIF + OCR), Audio (EXIF + transcription) - **Other**: ZIP (iterates contents), YouTube URLs, EPub ## Basic Usage ### Using uvx (no install) ```bash # Convert to stdout uvx markitdown input.pdf # Save to file uvx markitdown input.pdf -o output.md uvx markitdown input.docx > output.md # From stdin cat input.pdf | uvx markitdown ``` ### Using installed markitdown ```bash # Basic conversion markitdown "document.pdf" -o output.md # Redirect output markitdown "document.pdf" > output.md ``` ## Command Options ```bash -o OUTPUT # Output file -x EXTENSION # Hint file extension (for stdin) -m MIME_TYPE # Hint MIME type -c CHARSET # Hint charset (e.g., UTF-8) -d # Use Azure Document Intelligence -e ENDPOINT # Document Intelligence endpoint --use-plugins # Enable 3rd-party plugins --list-plugins # Show installed plugins ``` ## PDF Conversion with Images markitdown extracts text only. For PDFs with images, use this workflow: ### Step 1: Convert Text ```bash markitdown "document.pdf" -o output.md ``` ### Step 2: Extract Images ```bash # Create assets directory alongside the markdown mkdir -p assets # Extract images using PyMuPDF uv run --with pymupdf python scripts/extract_pdf_images.py "document.pdf" ./assets ``` ### Step 3: Add Image References Insert image references in the markdown where needed: ```markdown ![Description](assets/img_page1_1.png) ``` ### Step 4: Format Cleanup markitdown output often needs manual fixes: - Add proper heading levels (`#`, `##`, `###`) - Reconstruct tables in markdown format - Fix broken line breaks - Restore indentation structure ## Path Conversion (Windows/WSL) ```bash # Windows → WSL conversion C:\Users\name\file.pdf → /mnt/c/Users/name/file.pdf # Use helper script python scripts/convert_path.py "C:\Users\name\Documents\file.pdf" ``` ## Advanced Examples ### Convert Word document ```bash uvx markitdown report.docx -o report.md ``` ### Convert Excel spreadsheet ```bash uvx markitdown data.xlsx > data.md ``` ### Convert PowerPoint presentation ```bash uvx markitdown slides.pptx -o slides.md ``` ### Convert with file type hint (for stdin) ```bash cat document | uvx markitdown -x .pdf > output.md ``` ### Use Azure Document Intelligence for better PDF extraction ```bash uvx markitdown scan.pdf -d -e "https://your-resource.cognitiveservices.azure.com/" ``` ## Common Issues **"dependencies needed to read .pdf files"** ```bash # Install with PDF support uv tool install "markitdown[pdf]" --force ``` **FontBBox warnings during PDF conversion** - These are harmless font parsing warnings, output is still correct **Images missing from output** - Use `scripts/extract_pdf_images.py` to extract images separately ## Notes - Output preserves document structure: headings, tables, lists, links - First run caches dependencies; subsequent runs are faster - For complex PDFs with poor extraction, use `-d` with Azure Document Intelligence - Works on Windows, WSL, macOS, and Linux ## Resources - `scripts/extract_pdf_images.py` - Extract images from PDF using PyMuPDF - `scripts/convert_path.py` - Windows to WSL path converter - `references/conversion-examples.md` - Detailed examples for batch operations