# pdf-to-markdown > Convert PDF files to high-fidelity Markdown documents with a fetch score indicating extraction quality. Use when user wants to parse, extract, or convert a PDF to markdown/md format, or when user needs to extract all information (text, tables, images) from a PDF systematically. Produces markdown that mimics the original PDF structure, plus a quality score. Combines local Python extraction with optional LLM enhancement. - Author: Himanshu Sinha - Repository: himanshusin/superpowers - Version: 20260201201514 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/himanshusin/superpowers - Web: https://mule.run/skillshub/@@himanshusin/superpowers~pdf-to-markdown:20260201201514 --- --- name: pdf-to-markdown description: Convert PDF files to high-fidelity Markdown documents with a fetch score indicating extraction quality. Use when user wants to parse, extract, or convert a PDF to markdown/md format, or when user needs to extract all information (text, tables, images) from a PDF systematically. Produces markdown that mimics the original PDF structure, plus a quality score. Combines local Python extraction with optional LLM enhancement. --- # PDF to Markdown Extraction One-shot PDF to Markdown converter with quality scoring. Extracts text, tables, and images while preserving document structure. ## Quick Start ```bash # Install dependencies pip install pdfplumber pdf2image pytesseract Pillow pandas anthropic --break-system-packages # Basic extraction python scripts/extract_pdf.py input.pdf output.md # With LLM enhancement (recommended for complex documents) python scripts/extract_pdf.py input.pdf output.md --llm-enhance --verbose ``` ## Workflow ### Step 1: Install Dependencies ```bash pip install pdfplumber pdf2image pytesseract Pillow pandas --break-system-packages ``` For LLM enhancement (optional but recommended): ```bash pip install anthropic --break-system-packages ``` For OCR support on scanned PDFs: ```bash apt-get update && apt-get install -y poppler-utils tesseract-ocr ``` ### Step 2: Run Extraction Copy the PDF to working directory and run: ```bash python /path/to/skill/scripts/extract_pdf.py input.pdf output.md --llm-enhance -v ``` ### Step 3: Review Fetch Score The output markdown includes an **Extraction Report** at the end with: | Metric | Description | |--------|-------------| | **Overall Score** | 0-100 composite score with letter grade | | **Text Extraction** | How well text was captured | | **Structure Preservation** | Tables, images, formatting quality | | **Completeness** | Missing elements, warnings | **Score Interpretation:** - **A (90-100)**: Excellent - minimal manual review needed - **B (80-89)**: Good - minor formatting touch-ups may help - **C (70-79)**: Acceptable - some manual correction expected - **D (60-69)**: Fair - significant manual work likely - **F (<60)**: Poor - consider alternative extraction methods ## Output Format The generated markdown mimics PDF structure: ```markdown ## Document Title Body text preserved with paragraph structure... | Column 1 | Column 2 | Column 3 | |----------|----------|----------| | Data | Data | Data | ![Image 1.1](image_p1_1.png) --- ### Section Header More content... --- ## 📊 Extraction Report | Metric | Value | |--------|-------| | **Overall Score** | 85/100 (B - Good) | | **Text Extraction** | 90/100 | | **Structure Preservation** | 80/100 | | **Completeness** | 85/100 | ``` ## Features ### Local Python Extraction - **pdfplumber**: Primary text and table extraction - **pytesseract**: OCR fallback for scanned pages - **pdf2image**: Page-to-image conversion for OCR ### LLM Enhancement (--llm-enhance) - Fixes broken formatting and merged lines - Improves header hierarchy detection - Repairs malformed tables - Adds semantic markdown formatting (bold, italic) - Preserves all content without summarization ### Fetch Score Calculation The score evaluates extraction quality across three dimensions: 1. **Text Extraction (40%)**: Pages with extractable text, OCR success rate 2. **Structure Preservation (30%)**: Tables found, images detected, LLM enhancement 3. **Completeness (30%)**: Warnings count, OCR-only penalty ## Command Options ``` python extract_pdf.py [output.md] [options] Arguments: input_pdf Path to input PDF file output_md Output markdown path (default: input_name.md) Options: --llm-enhance Use Claude API to enhance extraction quality --verbose, -v Show detailed progress information --score-only Output only fetch score as JSON (no markdown) ``` ## Handling Edge Cases ### Scanned PDFs (Image-only) The script automatically attempts OCR when no text is extracted. For best results: ```bash # Ensure OCR tools installed apt-get install -y poppler-utils tesseract-ocr ``` ### Large PDFs For PDFs >50 pages, process in batches or use without LLM enhancement: ```bash python extract_pdf.py large.pdf output.md -v # No --llm-enhance ``` ### Complex Layouts Multi-column layouts may require LLM enhancement for best results: ```bash python extract_pdf.py complex.pdf output.md --llm-enhance -v ``` ## Troubleshooting | Issue | Solution | |-------|----------| | Low text score | Check if PDF is scanned (needs OCR) | | Tables not detected | Use --llm-enhance for better table parsing | | Missing images | Images are referenced, not embedded in MD | | OCR fails | Install tesseract-ocr and poppler-utils | | LLM errors | Check ANTHROPIC_API_KEY environment variable | ## Integration Example ```python from pathlib import Path import subprocess import json def extract_pdf_to_md(pdf_path: str, enhance: bool = True) -> tuple[str, dict]: """Extract PDF and return (markdown_content, fetch_score).""" output_path = Path(pdf_path).with_suffix('.md') cmd = ['python', 'scripts/extract_pdf.py', pdf_path, str(output_path)] if enhance: cmd.append('--llm-enhance') subprocess.run(cmd, check=True) # Get score separately score_result = subprocess.run( ['python', 'scripts/extract_pdf.py', pdf_path, '--score-only'], capture_output=True, text=True ) score = json.loads(score_result.stdout) return output_path.read_text(), score ```