# skill-ocr > Converts complex PDF documents into structured Markdown with semantic image extraction and layout analysis using PaddleOCR (PP-StructureV3). Use when you need to digitize PDFs while preserving: 1. Document hierarchy (headings, numbering, and sections). 2. Tables (automatically converted to clean Markdown tables). 3. Images (extracted, semantically renamed based on nearby titles/text, and referenced). 4. Reading order recovery (fixing multi-column or complex layouts). **CRITICAL**: This skill MUST be executed using its own internal virtual environment. - Author: lumen183 - Repository: lumen183/skill-ocr - Version: 20260123145402 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/lumen183/skill-ocr - Web: https://mule.run/skillshub/@@lumen183/skill-ocr~skill-ocr:20260123145402 --- --- name: skill-ocr description: > Converts complex PDF documents into structured Markdown with semantic image extraction and layout analysis using PaddleOCR (PP-StructureV3). Use when you need to digitize PDFs while preserving: 1. Document hierarchy (headings, numbering, and sections). 2. Tables (automatically converted to clean Markdown tables). 3. Images (extracted, semantically renamed based on nearby titles/text, and referenced). 4. Reading order recovery (fixing multi-column or complex layouts). **CRITICAL**: This skill MUST be executed using its own internal virtual environment. --- # PDF Structure OCR (PP-StructureV3) This skill utilizes PaddleOCR's latest PP-StructureV3 engine to transform PDF files into high-quality Markdown. It performs layout analysis, OCR, table recognition, and smart image processing. ## Usage This skill is self-contained. You **must** use the specific Python interpreter within the skill's directory to access pre-installed dependencies like `paddlepaddle-gpu` and `paddleocr`. [!WARNING] **High GPU Resource Usage**: This task is extremely GPU-intensive. - You **must only initiate one task at a time**. - Ensure sufficient VRAM is available before execution. ### Execution Command The primary script is `scripts/process_pdf.py`. Execute it using the internal environment: ```bash @path/env/bin/python @path/scripts/process_pdf.py [--output_md ] ``` **Parameters:** - ``: Path to the source PDF file. - ``: Directory where the Markdown and images folder will be created. - `--output_md`: (Optional) Custom name for the generated Markdown file. Defaults to `final_structured_result.md`. ### Key Features from the Script - **Semantic Image Renaming**: Automatically searches for the nearest heading or paragraph title to name extracted images (e.g., `P1_FinancialChart_0_898_71.jpg`), making the assets human-readable. - **Hierarchy & Layout Cleanup**: Fixes common OCR issues such as broken heading levels and redundant empty lines. - **Coordinate Tracking**: Retains the original image coordinates in the filename for traceability. - **Page Identification**: Injects `# Page N` markers at the start of each page's content for easier navigation. ## Output Structure - `/`: The finalized Markdown file with corrected paths for images. - `/imgs/`: A sub-folder containing all extracted figures, charts, and tables, named with semantic context. ## Example To process `report.pdf` and save results to an `out` folder with a custom name: ```bash @path/env/bin/python @path/scripts/process_pdf.py report.pdf ./out --output_md digitized_report.md ``` ## Constraints - **DO NOT** use the system `python3` or global `pip`. - **DO NOT** attempt to install additional packages. - **ALWAYS** reference the interpreter as `@path/env/bin/python`. - If the `env/` directory is missing, the skill is improperly installed and will fail.