# document-reading > Read and extract text from PDF files, Word documents, Excel spreadsheets, PowerPoint presentations, and images. Use when user asks to read, analyze, or extract content from .pdf, .docx, .xlsx, .pptx files or image files. Use markitdown to convert documents. Never use Read tool directly on binary files. - Author: Jonathan Glasmeyer - Repository: jonathanglasmeyer/dotfiles-2025 - Version: 20251217220623 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/jonathanglasmeyer/dotfiles-2025 - Web: https://mule.run/skillshub/@@jonathanglasmeyer/dotfiles-2025~document-reading:20251217220623 --- --- name: document-reading description: Read and extract text from PDF files, Word documents, Excel spreadsheets, PowerPoint presentations, and images. Use when user asks to read, analyze, or extract content from .pdf, .docx, .xlsx, .pptx files or image files. Use markitdown to convert documents. Never use Read tool directly on binary files. --- # Document Reading with markitdown **NEVER use Read tool on binary files (PDF, DOCX, PPTX, XLSX, images) - always use markitdown first.** ## Standard Workflow ```bash # 1. Convert to markdown in /tmp with descriptive name markitdown "/path/to/file.pdf" > /tmp/document.md # 2. Check line count to decide strategy LINES=$(wc -l < /tmp/document.md) # 3a. If small (< 300 lines): Read directly if [ $LINES -lt 300 ]; then cat /tmp/document.md fi # 3b. If large (≥ 300 lines): Preview first if [ $LINES -ge 300 ]; then head -n 200 /tmp/document.md # Ask user if they want more, then: cat /tmp/document.md fi ``` ## OCR Fallback for Scanned PDFs If markitdown output is minimal (< 50 characters) or looks like a scan: ```bash # 1. Check if PDF has text pdftotext input.pdf - | wc -c # If < 50 characters, it's likely a scan # 2. Run OCR (German language) ocrmypdf --language deu --output-type pdf input.pdf /tmp/input_ocr.pdf # 3. Try markitdown again on OCR'd version markitdown /tmp/input_ocr.pdf > /tmp/document.md cat /tmp/document.md ``` ## Progressive Reading for Very Large Docs For documents > 1000 lines: - Start: `head -n 200` (first 200 lines) - Next: `tail -n +201 | head -n 200` (lines 201-400) - Full: `cat /tmp/document.md` (if user requests) ## Supported Formats PDF, DOCX, PPTX, XLSX, images (OCR), HTML, text formats