# english-corpus-prep > Build corpus-ready English TXT data from mixed file formats. Use when Codex needs to ingest raw text from PDF, TXT/Markdown, HTML/XML, DOCX, JSON/JSONL, CSV/TSV, or unknown text-like files; detect input formats at the start; clean and normalize extracted text; and produce presentable, analysis-ready corpus outputs. - Author: Merlin - Repository: merlinxdyang/corpus_skill - Version: 20260208000142 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/merlinxdyang/corpus_skill - Web: https://mule.run/skillshub/@@merlinxdyang/corpus_skill~english-corpus-prep:20260208000142 --- --- name: english-corpus-prep description: Build corpus-ready English TXT data from mixed file formats. Use when Codex needs to ingest raw text from PDF, TXT/Markdown, HTML/XML, DOCX, JSON/JSONL, CSV/TSV, or unknown text-like files; detect input formats at the start; clean and normalize extracted text; and produce presentable, analysis-ready corpus outputs. --- # English Corpus Prep Prepare standardized UTF-8 TXT corpus output with deterministic format detection, extraction, cleaning, error logging, and Penn Treebank POS template export. Use the bundled script first; patch it only if a new format or project-specific rule is required. ## Workflow 1. Gather all input files or directories. 2. Run preflight checks for oversized files/data volumes; request confirmation when thresholds are exceeded. 3. Detect format before extraction for each file. 4. Extract raw text with the format-specific handler. 5. Decode non-UTF-8 text to UTF-8 when possible; log and skip on conversion failure. 6. Extract and save per-file metadata separately; remove metadata blocks from corpus text. 7. For PDF, keep narrative body text only (remove cover/title metadata, copyright blocks, TOC, references, and footnote-like noise). 8. Clean and normalize text using the default profile. 9. Log parse/encoding/readability failures and skip bad files. 10. Export presentable outputs: - per-file cleaned TXT files - combined clean corpus text - per-file PTB POS template files - combined POS template corpus - metadata files/manifests, error logs, and corpus report stats 11. Spot-check 3-5 outputs and adjust cleaning rules if needed. ## Run The Pipeline Use `scripts/build_corpus.py`: ```bash python3 scripts/build_corpus.py [ ...] --output-dir [--recursive] [--skip-empty] [--assume-yes] ``` Examples: ```bash python3 scripts/build_corpus.py ./raw --output-dir ./corpus_out --recursive python3 scripts/build_corpus.py ./raw/a.pdf ./raw/b.html --output-dir ./corpus_out python3 scripts/build_corpus.py ./raw --output-dir ./corpus_out --recursive --assume-yes ``` ## Format Detection And Extraction Behavior Detection priority: 1. File signature and extension (`.pdf`, `.html`, `.xml`, `.docx`, `.json`, `.jsonl`, `.csv`, `.tsv`, text extensions) 2. Lightweight content sniffing (`