# readonly-doc-extraction

> Read-only extraction of regulation-style PDFs into a structured JSON schema without modifying the source files. Use when asked to extract policy/regulation content from PDFs and save results as a new JSON file.

- Author: KeithCCC
- Repository: KChen-FMI/workspace-docs
- Version: 20260206105543
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/KChen-FMI/workspace-docs
- Web: https://mule.run/skillshub/@@KChen-FMI/workspace-docs~readonly-doc-extraction:20260206105543

---

---
name: readonly-doc-extraction
description: Read-only extraction of regulation-style PDFs into a structured JSON schema without modifying the source files. Use when asked to extract policy/regulation content from PDFs and save results as a new JSON file.
---

# Readonly Doc Extraction

## Overview
Extract regulation-style PDF content into a consistent JSON schema **without editing the source PDF**.  
This skill is **rule-based only** and does **not** use LLMs or external APIs.

## Workflow
1. Identify the input PDF path and a target output JSON path.
2. Use the extraction script to produce JSON in the recommended schema.
3. Verify required fields are present; do not edit the source file.

## Quick Start
Use the script in `scripts/extract_regulation_pdf.py`:

```bash
python C:/Users/kchen/.codex/skills/readonly-doc-extraction/scripts/extract_regulation_pdf.py \
  --input "PATH/TO/REGULATION.pdf" \
  --output "PATH/TO/REGULATION.extracted.json" \
  --doc-id "07-0120"
```

## Output Schema
See `references/schema.md` for the canonical JSON format.

## Guardrails
- **Read-only**: do not modify the input document.
- **No LLM / No API**: keep extraction local and deterministic.
- Always write to a **new** `.extracted.json` file.
- Preserve evidence link to the original PDF path.