# skill-ocr

> Converts complex PDF documents into structured Markdown with semantic image extraction and layout analysis using PaddleOCR (PP-StructureV3).  Use when you need to digitize PDFs while preserving: 1. Document hierarchy (headings, numbering, and sections). 2. Tables (automatically converted to clean Markdown tables). 3. Images (extracted, semantically renamed based on nearby titles/text, and referenced). 4. Reading order recovery (fixing multi-column or complex layouts). **CRITICAL**: This skill MUST be executed using its own internal virtual environment.

- Author: lumen183
- Repository: lumen183/skill-ocr
- Version: 20260123145402
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/lumen183/skill-ocr
- Web: https://mule.run/skillshub/@@lumen183/skill-ocr~skill-ocr:20260123145402

---

---
name: skill-ocr
description: >
  Converts complex PDF documents into structured Markdown with semantic image extraction and layout analysis using PaddleOCR (PP-StructureV3). 
  Use when you need to digitize PDFs while preserving:
  1. Document hierarchy (headings, numbering, and sections).
  2. Tables (automatically converted to clean Markdown tables).
  3. Images (extracted, semantically renamed based on nearby titles/text, and referenced).
  4. Reading order recovery (fixing multi-column or complex layouts).
  
  **CRITICAL**: This skill MUST be executed using its own internal virtual environment.
---

# PDF Structure OCR (PP-StructureV3)

This skill utilizes PaddleOCR's latest PP-StructureV3 engine to transform PDF files into high-quality Markdown. It performs layout analysis, OCR, table recognition, and smart image processing.

## Usage

This skill is self-contained. You **must** use the specific Python interpreter within the skill's directory to access pre-installed dependencies like `paddlepaddle-gpu` and `paddleocr`.

[!WARNING]
**High GPU Resource Usage**: This task is extremely GPU-intensive. 
- You **must only initiate one task at a time**.
- Ensure sufficient VRAM is available before execution.

### Execution Command

The primary script is `scripts/process_pdf.py`. Execute it using the internal environment:

```bash
@path/env/bin/python @path/scripts/process_pdf.py <input_pdf_path> <output_directory> [--output_md <filename.md>]
```

**Parameters:**
- `<input_pdf_path>`: Path to the source PDF file.
- `<output_directory>`: Directory where the Markdown and images folder will be created.
- `--output_md`: (Optional) Custom name for the generated Markdown file. Defaults to `final_structured_result.md`.

### Key Features from the Script
- **Semantic Image Renaming**: Automatically searches for the nearest heading or paragraph title to name extracted images (e.g., `P1_FinancialChart_0_898_71.jpg`), making the assets human-readable.
- **Hierarchy & Layout Cleanup**: Fixes common OCR issues such as broken heading levels and redundant empty lines.
- **Coordinate Tracking**: Retains the original image coordinates in the filename for traceability.
- **Page Identification**: Injects `# Page N` markers at the start of each page's content for easier navigation.

## Output Structure

- `<output_directory>/<output_md>`: The finalized Markdown file with corrected paths for images.
- `<output_directory>/imgs/`: A sub-folder containing all extracted figures, charts, and tables, named with semantic context.

## Example

To process `report.pdf` and save results to an `out` folder with a custom name:

```bash
@path/env/bin/python @path/scripts/process_pdf.py report.pdf ./out --output_md digitized_report.md
```

## Constraints

- **DO NOT** use the system `python3` or global `pip`.
- **DO NOT** attempt to install additional packages.
- **ALWAYS** reference the interpreter as `@path/env/bin/python`.
- If the `env/` directory is missing, the skill is improperly installed and will fail.