# pdf-to-data

> Extract structured data from PDFs to JSON, CSV, or Excel

- Author: Apoorv Garg
- Repository: apoorvgarg31/claude-code-skills
- Version: 20260202181546
- Stars: 8
- Forks: 1
- Last Updated: 2026-02-06
- Source: https://github.com/apoorvgarg31/claude-code-skills
- Web: https://mule.run/skillshub/@@apoorvgarg31/claude-code-skills~pdf-to-data:20260202181546

---

---
name: pdf-to-data
description: Extract structured data from PDFs to JSON, CSV, or Excel
argument-hint: <pdf-file> [--format json|csv|xlsx] [--output <file>]
user-invocable: true
allowed-tools: Read, Write, Bash
---

# PDF to Data Extractor

Extract tables, text, and form fields from PDFs into structured formats.

## Usage

```
/pdf-to-data invoice.pdf
/pdf-to-data report.pdf --format xlsx
/pdf-to-data forms.pdf --format csv --output data.csv
```

## Arguments

- `$0` - Path to PDF file (required)
- `--format` - Output format: `json` (default), `csv`, `xlsx`
- `--output` - Output file path (optional, defaults to stdout for json/csv)
- `--tables-only` - Extract only tables, skip general text
- `--page` - Extract specific page number (1-indexed)

## How It Works

1. **Text Extraction:** Extracts all text content with layout preservation
2. **Table Detection:** Uses PyMuPDF's table detection to find and extract tabular data
3. **Form Fields:** Extracts form field names and values from fillable PDFs
4. **Metadata:** Includes document metadata (title, author, creation date)

## Requirements

Ensure PyMuPDF is installed:

```bash
pip install pymupdf openpyxl
```

## Running the Extraction

```bash
python ~/.claude/skills/pdf-to-data/scripts/extract.py "$0" $ARGUMENTS
```

Or if running from the project directory:

```bash
python ./scripts/extract.py "$0" $ARGUMENTS
```

## Output Formats

### JSON (default)
```json
{
  "metadata": { "title": "...", "pages": 5 },
  "pages": [
    {
      "number": 1,
      "text": "...",
      "tables": [[["Header1", "Header2"], ["row1col1", "row1col2"]]],
      "form_fields": [{"name": "field1", "value": "..."}]
    }
  ]
}
```

### CSV
Exports all tables as CSV. Multiple tables are separated by blank lines.

### XLSX
Creates an Excel workbook with:
- Sheet "Text" - Full text content per page
- Sheet "Table_1", "Table_2", etc. - Each detected table
- Sheet "Form_Fields" - All form field data

## Examples

**Basic extraction:**
```
/pdf-to-data quarterly-report.pdf
```

**Export tables to Excel:**
```
/pdf-to-data financial-data.pdf --format xlsx --output financials.xlsx
```

**Extract specific page as CSV:**
```
/pdf-to-data large-doc.pdf --page 3 --tables-only --format csv
```

## Error Handling

- If PDF is encrypted, the script will report the error
- If no tables found, returns empty `tables` array
- Invalid page numbers are handled gracefully