# pdf-processing-pro > Advanced PDF manipulation (OCR, forms, tables). Use this skill any time a user wants to: read, extract, merge, split, rotate, watermark, encrypt, decrypt, fill forms, extract tables, or perform OCR on PDF files - Author: EnzoGiglioEB - Repository: EnzoGiglioEB/ai-resources - Version: 20260129120711 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/EnzoGiglioEB/ai-resources - Web: https://mule.run/skillshub/@@EnzoGiglioEB/ai-resources~pdf-processing-pro:20260129120711 --- --- name: pdf-processing-pro description: "Advanced PDF manipulation (OCR, forms, tables). Use this skill any time a user wants to: read, extract, merge, split, rotate, watermark, encrypt, decrypt, fill forms, extract tables, or perform OCR on PDF files" license: Apache 2.0 --- # PDF Processing Pro ## Overview Comprehensive PDF processing capabilities including OCR, form filling, table extraction, and document manipulation. ## Capabilities ### Document Manipulation - **Merge PDFs**: Combine multiple PDFs into one - **Split PDFs**: Extract pages or split into multiple files - **Rotate pages**: Change page orientation - **Add watermarks**: Text or image watermarks - **Encrypt/Decrypt**: Password protection ### Content Extraction - **OCR**: Extract text from scanned PDFs - **Tables**: Extract tables to structured data - **Images**: Extract embedded images - **Text**: Extract searchable text content ### Form Processing - **Fill forms**: Populate PDF form fields programmatically - **Extract form data**: Read form field values - **Analyze forms**: Detect form fields and structure ## Key Tools ### PyPDF2 Basic PDF manipulation (merge, split, rotate) ```python from PyPDF2 import PdfReader, PdfWriter # Merge PDFs writer = PdfWriter() for pdf in ['file1.pdf', 'file2.pdf']: reader = PdfReader(pdf) for page in reader.pages: writer.add_page(page) writer.write('merged.pdf') ``` ### Tesseract OCR Extract text from scanned/image PDFs ```bash # OCR a PDF tesseract input.pdf output -l eng pdf # Multi-language OCR tesseract input.pdf output -l eng+por pdf ``` ### Tabula/Camelot Extract tables from PDFs ```python import tabula # Extract all tables tables = tabula.read_pdf('document.pdf', pages='all') # Extract specific area tables = tabula.read_pdf('document.pdf', area=[0,0,100,100]) ``` ### pdftk Advanced PDF toolkit ```bash # Fill PDF form pdftk form.pdf fill_form data.fdf output filled.pdf # Rotate pages pdftk input.pdf cat 1-endeast output rotated.pdf # Add password pdftk input.pdf output secured.pdf user_pw PASSWORD ``` ## Workflows ### OCR Workflow 1. Check if PDF is searchable 2. If not, convert to images 3. Run OCR with Tesseract 4. Create searchable PDF **Reference**: See `OCR.md` for detailed OCR guide ### Form Processing Workflow 1. Analyze form structure 2. Extract field names and types 3. Prepare form data 4. Fill form programmatically **Reference**: See `FORMS.md` for form processing guide ### Table Extraction Workflow 1. Identify table pages 2. Extract with Tabula or Camelot 3. Clean and structure data 4. Export to CSV/Excel **Reference**: See `TABLES.md` for table extraction guide ## Best Practices 1. **Test with sample pages** before processing entire document 2. **Preserve original files** - always work on copies 3. **OCR language selection** - specify correct language(s) 4. **Table extraction** - try both Tabula and Camelot for best results 5. **Form validation** - verify field names before filling ## Common Use Cases ### Merge Multiple PDFs ```python # Simple merge pdftk file1.pdf file2.pdf cat output merged.pdf ``` ### Extract Specific Pages ```python # Pages 1-5 and 10 pdftk input.pdf cat 1-5 10 output extracted.pdf ``` ### OCR Scanned Document ```bash # OCR with Portuguese + English tesseract scan.pdf output -l por+eng pdf ``` ### Extract Tables ```python import tabula df_list = tabula.read_pdf('report.pdf', pages='all') for i, df in enumerate(df_list): df.to_csv(f'table_{i}.csv') ``` ## Bundled Resources ### Documentation - `OCR.md` - Complete OCR processing guide - `FORMS.md` - PDF form handling guide - `TABLES.md` - Table extraction guide ### Scripts - `scripts/analyze_form.py` - Analyze PDF form structure --- **Note**: This skill supports both Python (PyPDF2, tabula, camelot) and command-line tools (tesseract, pdftk). Ensure required tools are installed for full functionality.