# pdf-reader > Extract text, tables, and images from PDF files using pdfplumber and PyMuPDF. Use when analyzing PDF documents, brand materials, reports, or any content that requires structured extraction from PDF format. Supports table detection, layout preservation, and high-quality image extraction. - Author: ledagame - Repository: ledagame/persona-theater - Version: 20251213014344 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/ledagame/persona-theater - Web: https://mule.run/skillshub/@@ledagame/persona-theater~pdf-reader:20251213014344 --- --- name: pdf-reader description: Extract text, tables, and images from PDF files using pdfplumber and PyMuPDF. Use when analyzing PDF documents, brand materials, reports, or any content that requires structured extraction from PDF format. Supports table detection, layout preservation, and high-quality image extraction. --- # PDF Reader Extract comprehensive content from PDF files including text, tables, images, and metadata using industry-standard Python libraries. ## When to Use This Skill Use this skill when: - Analyzing brand documents, design briefs, or company materials in PDF format - Extracting structured data from reports, invoices, or forms with tables - Retrieving images, logos, or graphics embedded in PDFs - Converting PDF content into markdown or structured formats for further processing - Investigating PDF metadata, page layouts, or document structure ## How It Works This skill combines two powerful libraries: 1. **pdfplumber** - Excels at text extraction with layout preservation and table detection 2. **PyMuPDF** - Provides fast image extraction and comprehensive metadata access ## Quick Start ### Install Dependencies ```bash pip install pdfplumber pymupdf pillow ``` ### Extract from PDF ```bash python scripts/extract_pdf.py --output-dir ./output ``` ## Output Structure The script creates the following output: ``` output/ ├── extracted_text.md # All text content with page numbers ├── extracted_tables.json # Structured table data ├── metadata.json # PDF metadata and page info └── images/ # Extracted images ├── page_1_image_1.png ├── page_2_image_1.jpeg └── ... ``` ## Usage in Claude Code When a user provides a PDF file for analysis: 1. **Check dependencies**: Verify pdfplumber and PyMuPDF are installed 2. **Run extraction script**: Execute `scripts/extract_pdf.py` with the PDF path 3. **Analyze output**: Read the generated markdown and JSON files 4. **Present findings**: Summarize key content, tables, and images found Example workflow: ```bash # Step 1: Extract PDF content python scripts/extract_pdf.py "brand_document.pdf" --output-dir ./brand_analysis # Step 2: Read extracted text cat ./brand_analysis/extracted_text.md # Step 3: Analyze tables cat ./brand_analysis/extracted_tables.json # Step 4: Check images ls ./brand_analysis/images/ ``` ## Advanced Features ### Text Extraction Options The script uses `layout=True` to preserve formatting and spacing, which is crucial for: - Maintaining column layouts - Preserving visual hierarchy - Keeping address and contact information structured ### Table Detection pdfplumber automatically detects tables using: - Line-based strategies for ruled tables - Text alignment strategies for borderless tables - Configurable tolerance settings for complex layouts ### Image Quality PyMuPDF extracts images in their original format (JPEG, PNG) without re-encoding, preserving: - Original resolution and quality - Color profiles and metadata - Transparency and alpha channels ## Troubleshooting ### Missing Dependencies If the script fails with import errors: ```bash pip install --upgrade pdfplumber pymupdf pillow ``` ### Encrypted PDFs For password-protected PDFs, use PyMuPDF's authentication: ```python import pymupdf doc = pymupdf.open("encrypted.pdf") doc.authenticate("password") ``` ### Large Files For very large PDFs (>100MB), process pages in batches to manage memory usage. ## Technical Details ### Library Comparison | Feature | pdfplumber | PyMuPDF | |---------|-----------|---------| | Text extraction | ⭐⭐⭐ (Best layout preservation) | ⭐⭐ (Fast, good quality) | | Table detection | ⭐⭐⭐ (Excellent) | ⭐ (Basic) | | Image extraction | ⭐ (Basic) | ⭐⭐⭐ (Best performance) | | Speed | ⭐⭐ (Moderate) | ⭐⭐⭐ (Very fast) | | Memory usage | ⭐⭐ (Higher) | ⭐⭐⭐ (Lower) | ### Why Both Libraries? Using both libraries provides optimal results: - **pdfplumber** for text and tables (superior accuracy) - **PyMuPDF** for images and metadata (superior performance) This combination ensures comprehensive extraction without compromise. ## Examples ### Brand Document Analysis ```bash python scripts/extract_pdf.py "company_brand_guidelines.pdf" --output-dir ./brand_analysis ``` Expected output: - Brand values and mission statements (text) - Color palette tables (tables) - Logo variations (images) - Document metadata (metadata) ### Design Brief Extraction ```bash python scripts/extract_pdf.py "design_brief.pdf" --output-dir ./brief_analysis ``` Expected output: - Project requirements (text) - Timeline and milestones (tables) - Reference images (images) - Client information (metadata) ## Integration Tips ### With Logo Design Workflows After extracting brand guidelines: 1. Review extracted text for brand keywords and values 2. Analyze color palette tables for official brand colors 3. Examine logo images for existing design elements 4. Use findings to inform new logo design strategy ### With Research Workflows After extracting research documents: 1. Convert extracted tables to structured data 2. Process images for visual analysis 3. Generate summary reports from text content 4. Cross-reference metadata for document versioning