# pdf-parity-checker > Verify visual and structural parity between XHTML chapters and POD PDF files. Use to ensure print edition matches digital EPUB layout. - Author: miketui - Repository: miketui/Fm - Version: 20251217110951 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/miketui/Fm - Web: https://mule.run/skillshub/@@miketui/Fm~pdf-parity-checker:20251217110951 --- --- name: pdf-parity-checker description: Verify visual and structural parity between XHTML chapters and POD PDF files. Use to ensure print edition matches digital EPUB layout. --- # PDF Parity Checker Skill ## Purpose Compare the 44 XHTML chapter files against their corresponding POD (print-on-demand) PDF files to ensure visual and structural consistency. This is critical for maintaining brand quality across digital and print editions. ## When to Invoke - User asks "do the PDFs match the EPUB chapters?" - Before sending POD files to IngramSpark or print vendor - After making changes to XHTML or CSS - User mentions "print edition" or "PDF consistency" - User asks "verify the PDFs are up to date" ## Workflow ### Run PDF Parity Verification ```bash python3 scripts/pdf_verify.py \ --root REBRANDED_OUTPUT \ --targets docs/REBRANDED_VISUAL_AUDIT.json \ --update-json ``` **What it does:** 1. For each of the 44 XHTML files: - Locates corresponding PDF in `REBRANDED_OUTPUT/pdf-pod/` - Compares: - Page count (XHTML rendered vs PDF pages) - Media box dimensions (PDF page size) - First-page visual hash (downscaled grayscale comparison) - Text extraction and paragraph continuity 2. If PDF is missing: - Generates temporary reference PDF via headless browser print-to-PDF - Uses this for comparison (but does NOT commit to repo) - Flags as "MISSING" in report 3. Updates `docs/REBRANDED_VISUAL_AUDIT.json` with: - `pdf_check` object for each chapter - Fields: `page_count_match`, `bbox_match`, `image_hash_delta`, `pdf_status` ## Comparison Metrics ### 1. Page Count Match Compares rendered XHTML page count vs PDF page count. **Example:** ``` Chapter IX: "Unveiling Your Creative Odyssey" - XHTML rendered: 8 pages (at 6×9" print size) - PDF actual: 8 pages - Status: ✅ MATCH ``` **Acceptable variance:** - Exact match: ✅ PASS - ±1 page: ⚠️ WARN (minor reflow difference) - ±2+ pages: ❌ FAIL (significant layout mismatch) ### 2. Media Box (Page Size) Verifies PDF pages are correct physical dimensions. **Expected for 6×9" POD:** - Width: 432 points (6 inches × 72 DPI) - Height: 648 points (9 inches × 72 DPI) **Example:** ``` Chapter XV: Media box check - Expected: 432×648 pt - Actual: 432×648 pt - Status: ✅ MATCH ``` ### 3. Visual Hash Comparison Computes perceptual hash of first page to detect visual differences. **Process:** 1. Render XHTML first page as PNG (grayscale, downscaled to 200×300) 2. Convert PDF first page to PNG (same size) 3. Compute average hash for both 4. Calculate Hamming distance **Scoring:** - Hash delta 0-5: ✅ IDENTICAL (perfect match) - Hash delta 6-15: ✅ SIMILAR (acceptable variance) - Hash delta 16-30: ⚠️ DIFFERENT (minor layout shift) - Hash delta >30: ❌ MISMATCH (significant visual difference) **Example:** ``` Chapter IV: Visual hash comparison - XHTML hash: d4a3f2c1... - PDF hash: d4a3f2c1... - Hamming distance: 3 - Status: ✅ IDENTICAL ``` ### 4. Text Extraction Extracts text from PDF and verifies key content is present. **Checks:** - Chapter title appears in first 500 characters - Heading order matches XHTML heading structure - Paragraph count is similar (±10%) **Example:** ``` Chapter XII: Text extraction - Title found: ✅ "Financial Wisdom" - Headings: 12 in XHTML, 12 in PDF ✅ - Paragraphs: 84 in XHTML, 83 in PDF ✅ (within 10%) - Status: ✅ PASS ``` ## Interpreting Results ### JSON Output Structure ```json { "file": "REBRANDED_OUTPUT/xhtml/9-chapter-i-unveiling-your-creative-odyssey.xhtml", "basename": "9-chapter-i-unveiling-your-creative-odyssey", "pdf_check": { "pdf_path": "REBRANDED_OUTPUT/pdf-pod/chapters/9-chapter-i-unveiling-your-creative-odyssey.pdf", "pdf_status": "ok", "page_count_match": true, "page_count_xhtml": 8, "page_count_pdf": 8, "bbox_match": true, "bbox_expected": [432, 648], "bbox_actual": [432, 648], "image_hash_delta": 3, "image_hash_verdict": "identical", "text_checks": { "title_found": true, "heading_count_match": true, "paragraph_variance_pct": 1.2 } } } ``` ### Markdown Summary Generated in `docs/REBRANDED_VISUAL_AUDIT.md`: | File | PDF Status | Page Match | Visual Match | Issues | |------|------------|------------|--------------|--------| | 9-chapter-i-... | ✅ OK | ✅ 8 pages | ✅ Identical | None | | 15-chapter-vi-... | ⚠️ OK | ⚠️ 10 vs 11 | ✅ Similar | +1 page variance | | 22-chapter-xii-... | ❌ MISSING | N/A | N/A | PDF not found | ## Common Issues and Fixes ### Issue: Page Count Mismatch **Symptom:** XHTML renders as 8 pages, PDF has 9 pages **Possible causes:** 1. Extra blank page in PDF (page break issue) 2. Different margin settings between XHTML and PDF export 3. Widow/orphan control differences **How to fix:** 1. Open PDF in Acrobat to verify blank page 2. Adjust `print-pod.css` orphans/widows settings: ```css p { orphans: 2; widows: 2; } ``` 3. Re-export PDF from InDesign or print-to-PDF workflow 4. Re-run parity check to verify ### Issue: Visual Hash Mismatch **Symptom:** Hash delta >30 (significant visual difference) **Possible causes:** 1. Font substitution in PDF vs XHTML 2. Image resolution difference 3. Different CSS applied (print vs digital styles) **How to fix:** 1. Open both side-by-side: - XHTML screenshot: `docs/screenshots/{basename}/title_1080x1440.png` - PDF page 1: Open in Acrobat, export as PNG 2. Identify visual difference (typography, spacing, images) 3. Adjust `print-pod.css` to match digital rendering 4. Regenerate PDF 5. Re-run parity check ### Issue: Missing PDF **Symptom:** `pdf_status: "missing"` **Action:** 1. Check if PDF was not generated yet for this chapter 2. Verify PDF naming matches XHTML basename: - XHTML: `9-chapter-i-unveiling-your-creative-odyssey.xhtml` - PDF should be: `9-chapter-i-unveiling-your-creative-odyssey.pdf` 3. If PDF exists with different name, rename to match 4. If PDF doesn't exist: - Generate from XHTML using print-to-PDF workflow - Place in appropriate `pdf-pod/` subdirectory ### Issue: Text Extraction Failure **Symptom:** `text_checks.title_found: false` **Possible causes:** 1. PDF is image-only (scanned, not text-based) 2. PDF has font embedding issues (text not extractable) 3. PDF was generated with text as outlines/curves **How to fix:** 1. Verify PDF is text-based (select text in Acrobat) 2. If image-only, regenerate PDF from source XHTML 3. Ensure fonts are embedded as subsets (not outlines) ## Example Invocation User: "Are the print PDFs in sync with the latest EPUB chapters?" Agent response: 1. Run PDF parity checker 2. Read results from updated JSON 3. Summarize: ``` PDF Parity Check Complete ========================= Total chapters: 44 PDFs verified: 42 Missing PDFs: 2 Results: ✅ PASS: 38 chapters (perfect parity) ⚠️ WARN: 4 chapters (minor variance) ❌ FAIL: 0 chapters (significant mismatch) 🔍 MISSING: 2 chapters (PDFs not found) Warnings (4 chapters): - Chapter VI: +1 page variance (11 vs 10) - Extra blank page in PDF - Chapter XI: Visual hash delta 18 - Minor font rendering difference - Chapter XV: Text variance 12% - Some paragraphs reflowed - Chapter XXII: Media box 432×660 - Height should be 648 Missing PDFs (2 chapters): - 30-SelfAssessment.xhtml (no matching PDF found) - 43-DoodlePage.xhtml (no matching PDF found) Recommended actions: 1. Fix page break in Chapter VI 2. Review font settings for Chapter XI 3. Generate missing PDFs for Self-Assessment and Doodle pages 4. Verify media box for Chapter XXII Full report: docs/REBRANDED_VISUAL_AUDIT.md (PDF Parity column) Detailed JSON: docs/REBRANDED_VISUAL_AUDIT.json (pdf_check objects) ``` ## Integration with Other Skills **Run after:** - `epub-visual-auditor` - Ensure XHTML rendering is correct first **Run before:** - Sending POD files to print vendor - Uploading to IngramSpark or KDP Print - Final publication package **Pair with:** - `epub-publication-validator` - Comprehensive pre-publication check ## Notes - PDF comparison requires `pypdf` and `Pillow` Python libraries - First run may be slower (generates temporary PDFs for missing files) - Temporary reference PDFs are stored in `/tmp/` and not committed to repo - Visual hash comparison is perceptual (small rendering differences are OK) - Re-run after any CSS or XHTML changes to verify parity maintained