# academic-pdf-redaction > Redact text from PDF documents for blind review anonymization - Author: JierunChen - Repository: benchflow-ai/skillsbench - Version: 20260122170733 - Stars: 501 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/benchflow-ai/skillsbench - Web: https://mule.run/skillshub/@@benchflow-ai/skillsbench~academic-pdf-redaction:20260122170733 --- --- name: academic-pdf-redaction description: Redact text from PDF documents for blind review anonymization --- # PDF Redaction for Blind Review Redact identifying information from academic papers for blind review. ## CRITICAL RULES 1. **PRESERVE References section** - Self-citations MUST remain intact 2. **ONLY redact specific text matches** - Never redact entire pages/regions 3. **VERIFY output** - Check that 80%+ of original text remains ## Common Pitfalls to AVOID ```python # ❌ WRONG - This removes ALL text from the page: for block in page.get_text("blocks"): page.add_redact_annot(fitz.Rect(block[:4])) # ❌ WRONG - Drawing rectangles over text: page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0)) # ✅ CORRECT - Only redact specific search matches: for rect in page.search_for("John Smith"): page.add_redact_annot(rect) ``` ## Patterns to Redact (Before References Only) **IMPORTANT: Use FULL names/phrases, not partial matches!** - ✅ "John Smith" (full name) - ❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References) 1. **Author names** - FULL names only (e.g., "John Smith", not just "Smith") 2. **Affiliations** - Universities, companies (e.g., "Duke University") 3. **Email addresses** - Pattern: `*@*.edu`, `*@*.com` 4. **Venue names** - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop") 5. **arXiv identifiers** - Pattern: `arXiv:XXXX.XXXXX` 6. **DOIs** - Pattern: `10.XXXX/...` 7. **Acknowledgement names** - Names in "Acknowledgements" section 8. **Equal contribution footnotes** - e.g., "Equal contribution", "* Equal contribution" ## PyMuPDF (fitz) - Recommended Approach ```python import fitz import os def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]): """Redact specific patterns from PDF using PyMuPDF.""" doc = fitz.open(input_path) original_len = sum(len(p.get_text()) for p in doc) # Find References page - stop redacting there references_page = None for i, page in enumerate(doc): if "references" in page.get_text().lower(): references_page = i break for page_num, page in enumerate(doc): if references_page is not None and page_num >= references_page: continue # Skip References section for pattern in patterns: # ONLY redact exact search matches for rect in page.search_for(pattern): page.add_redact_annot(rect, fill=(0, 0, 0)) page.apply_redactions() os.makedirs(os.path.dirname(output_path), exist_ok=True) doc.save(output_path) doc.close() # MUST verify after saving verify_redaction(input_path, output_path) ``` ## REQUIRED: Verification Function **Always run this after ANY redaction to catch errors early:** ```python import fitz def verify_redaction(original_path, output_path): """Verify redaction didn't corrupt the PDF.""" orig = fitz.open(original_path) redc = fitz.open(output_path) orig_len = sum(len(p.get_text()) for p in orig) redc_len = sum(len(p.get_text()) for p in redc) print(f"Original: {len(orig)} pages, {orig_len} chars") print(f"Redacted: {len(redc)} pages, {redc_len} chars") print(f"Retained: {redc_len/orig_len:.1%}") # DEFENSIVE CHECKS - fail fast if something went wrong if len(redc) != len(orig): raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}") if redc_len < 1000: raise ValueError(f"PDF corrupted: only {redc_len} chars remain!") if redc_len < orig_len * 0.7: raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}") orig.close() redc.close() print("✓ Verification passed") ```