# pdf-reading

> Read local PDFs to extract and verify exact numbers (counts, percentages, tables, figure captions) for papers/questions in this repository. Use this when asked to “read a PDF”, “extract results from the paper”, “verify a statistic”, or “find the exact wording in the paper”.

- Author: Zeger Knops
- Repository: pondevelopment/llm-training
- Version: 20260206165052
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/pondevelopment/llm-training
- Web: https://mule.run/skillshub/@@pondevelopment/llm-training~pdf-reading:20260206165052

---

---
name: pdf-reading
description: Read local PDFs to extract and verify exact numbers (counts, percentages, tables, figure captions) for papers/questions in this repository. Use this when asked to “read a PDF”, “extract results from the paper”, “verify a statistic”, or “find the exact wording in the paper”.
---

## Goal

When you need facts from a paper PDF (counts, percentages, benchmark numbers, claims, limitations), extract *verbatim* evidence from the PDF and compute derived values yourself.

This repository’s content often depends on exact values from tables/figures (not abstracts). Always bias toward **precision and traceability**.

## Process

1. **Locate the PDF**
   - Search the repo for `.pdf` files.
   - If a paper directory contains a source PDF, prefer that.
   - If the only PDF is in `tmp/` or the repo root, confirm it corresponds to the paper in question before using it.

2. **Extract text locally (no network fetches)**
   - Prefer a local text extraction flow:
     - Use `.github/skills/pdf-reading/extract_pdf_text.py` to create a plain-text copy in `tmp/`.
     - If extraction fails, try a different backend (`pypdf` vs `pdftotext`) or fall back to manual inspection.

3. **Search within the extracted text**
   - Use targeted queries first (unique phrases, table titles, “Table 2”, “Appendix”, metric names).
   - For numbers, search patterns like `n=`, `N=`, `(`, `%)`, `Table`, `Figure`.

4. **Verify statistics (repo requirement)**
   - Prefer raw counts (e.g., “31/50”) over percentages when available.
   - If the paper gives counts, compute percentages yourself: $\text{pct} = 100 \times \frac{\text{numerator}}{\text{denominator}}$.
   - If a value is ambiguous (multiple similar tables/ablations), capture the surrounding label/context.

5. **Handle common PDF pitfalls**
   - **Hyphenation and line breaks:** words may be split across lines; search both with and without hyphens.
   - **Tables:** extracted text may be messy; search by row/column headers and unique tokens.
   - **Scanned PDFs:** text extraction may fail; use manual reading if needed.

## Output expectations

- When updating a question/paper, report the exact extracted phrase/value and where it came from (section/table/figure name).
- If you cannot reliably extract the needed value, explicitly say so and propose next steps (e.g., manual verification).

## Commands

- Extract text:
   - `python3 .github/skills/pdf-reading/extract_pdf_text.py path/to/paper.pdf`

- Extract to a specific file:
   - `python3 .github/skills/pdf-reading/extract_pdf_text.py path/to/paper.pdf --out tmp/paper.txt`

## Repository conventions to respect

- Keep diffs minimal and consistent with existing patterns.
- Park derived artifacts under `tmp/` (gitignored).
- Don’t add new dependencies unless explicitly requested; prefer optional tooling or clear fallbacks.