# document-reading

> Read and extract text from PDF files, Word documents, Excel spreadsheets, PowerPoint presentations, and images. Use when user asks to read, analyze, or extract content from .pdf, .docx, .xlsx, .pptx files or image files. Use markitdown to convert documents. Never use Read tool directly on binary files.

- Author: Jonathan Glasmeyer
- Repository: jonathanglasmeyer/dotfiles-2025
- Version: 20251217220623
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/jonathanglasmeyer/dotfiles-2025
- Web: https://mule.run/skillshub/@@jonathanglasmeyer/dotfiles-2025~document-reading:20251217220623

---

---
name: document-reading
description: Read and extract text from PDF files, Word documents, Excel spreadsheets, PowerPoint presentations, and images. Use when user asks to read, analyze, or extract content from .pdf, .docx, .xlsx, .pptx files or image files. Use markitdown to convert documents. Never use Read tool directly on binary files.
---

# Document Reading with markitdown

**NEVER use Read tool on binary files (PDF, DOCX, PPTX, XLSX, images) - always use markitdown first.**

## Standard Workflow

```bash
# 1. Convert to markdown in /tmp with descriptive name
markitdown "/path/to/file.pdf" > /tmp/document.md

# 2. Check line count to decide strategy
LINES=$(wc -l < /tmp/document.md)

# 3a. If small (< 300 lines): Read directly
if [ $LINES -lt 300 ]; then
  cat /tmp/document.md
fi

# 3b. If large (≥ 300 lines): Preview first
if [ $LINES -ge 300 ]; then
  head -n 200 /tmp/document.md
  # Ask user if they want more, then: cat /tmp/document.md
fi
```

## OCR Fallback for Scanned PDFs

If markitdown output is minimal (< 50 characters) or looks like a scan:

```bash
# 1. Check if PDF has text
pdftotext input.pdf - | wc -c
# If < 50 characters, it's likely a scan

# 2. Run OCR (German language)
ocrmypdf --language deu --output-type pdf input.pdf /tmp/input_ocr.pdf

# 3. Try markitdown again on OCR'd version
markitdown /tmp/input_ocr.pdf > /tmp/document.md
cat /tmp/document.md
```

## Progressive Reading for Very Large Docs

For documents > 1000 lines:
- Start: `head -n 200` (first 200 lines)
- Next: `tail -n +201 | head -n 200` (lines 201-400)
- Full: `cat /tmp/document.md` (if user requests)

## Supported Formats

PDF, DOCX, PPTX, XLSX, images (OCR), HTML, text formats