# converting-html-to-pdf > Converts multiple HTML files into a single PDF using pandoc. Use when asked to merge HTML pages into PDF, create a PDF book from HTML, or fix PDF generation issues. - Author: Emmanuel Oh - Repository: emmaneugene/agents - Version: 20260208001823 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-09 - Source: https://github.com/emmaneugene/agents - Web: https://mule.run/skillshub/@@emmaneugene/agents~converting-html-to-pdf:20260208001823 --- --- name: converting-html-to-pdf description: Converts multiple HTML files into a single PDF using pandoc. Use when asked to merge HTML pages into PDF, create a PDF book from HTML, or fix PDF generation issues. --- # Converting HTML to PDF Converts a collection of downloaded HTML files into a single PDF document using pandoc and xelatex. ## Prerequisites - `pandoc` installed - A TeX engine (xelatex recommended) - Node.js for preprocessing scripts ## Workflow ### 1. Merge HTML Files Create a merge script that combines all HTML files in the correct order: ```javascript // merge_html.js const fs = require('fs'); const path = require('path'); // List HTML files in order (customize based on naming convention) const files = fs.readdirSync('.') .filter(f => f.endsWith('.html') && !f.includes('merged')) .sort(); let merged = ''; for (const file of files) { merged += fs.readFileSync(file, 'utf8') + '\n'; } fs.writeFileSync('merged.html', merged); ``` ### 2. Preprocess HTML for Pandoc Compatibility Many HTML patterns don't convert well to PDF. Create a preprocessing script to fix common issues: ```javascript // preprocess_html.js const fs = require('fs'); let html = fs.readFileSync('merged.html', 'utf8'); // Example: Convert table-based admonition blocks to blockquotes html = html.replace( /
\s*[\s\S]*?
([\s\S]*?)<\/td>[\s\S]*?<\/table>\s*<\/div>/g, (match, type, content) => { const label = type.charAt(0).toUpperCase() + type.slice(1); return `

${label}: ${content.trim()}

`; } ); fs.writeFileSync('preprocessed.html', html); ``` ### 3. Generate PDF with Pandoc ```bash pandoc "preprocessed.html" -o "output.pdf" \ --pdf-engine=xelatex \ -V geometry:margin=1in \ --toc --toc-depth=3 \ -V colorlinks=true \ -V linkcolor=blue \ -V urlcolor=blue ``` ### 4. Validate Output (Iterative) **This step is critical.** After generating the PDF, visually validate it: 1. Use `look_at` tool to examine the PDF and check for: - Missing content - Broken formatting - Garbled text from unconverted HTML elements - Layout issues 2. If issues are found: - Identify the problematic HTML patterns in the source - Add preprocessing rules to handle them - Regenerate the PDF - Validate again 3. Repeat until the PDF renders correctly. ## Common HTML Issues and Fixes | Issue | HTML Pattern | Fix | |-------|--------------|-----| | Admonition blocks | `
` with tables | Convert to `
` | | Complex tables | Nested tables for layout | Flatten or convert to divs | | Icon fonts | `` | Remove or replace with text | | SVG images | Inline SVG | Extract to files or remove | | Custom components | Framework-specific elements | Convert to standard HTML | ## Makefile Template ```makefile PDF_HTML := preprocessed.html PDF_OUTPUT := output.pdf .PHONY: pdf preprocess merge merge: node merge_html.js preprocess: merge node preprocess_html.js pdf: preprocess pandoc "$(PDF_HTML)" -o "$(PDF_OUTPUT)" --pdf-engine=xelatex -V geometry:margin=1in --toc --toc-depth=3 -V colorlinks=true -V linkcolor=blue -V urlcolor=blue ``` ## Validation Checklist When examining the PDF output, check: - [ ] Table of contents generated correctly - [ ] All chapters/sections present - [ ] Code blocks render with proper formatting - [ ] Images display (or are acceptably absent) - [ ] No raw HTML visible in text - [ ] Blockquotes and callouts formatted nicely - [ ] Links are clickable (blue text) - [ ] Page breaks at reasonable locations