Build corpus-ready English TXT data from mixed file formats. Use when Codex needs to ingest raw text from PDF, TXT/Markdown, HTML/XML, DOCX, JSON/JSONL, CSV/TSV, or unknown text-like files; detect input formats at the start; clean and normalize extracted text; and produce presentable, analysis-ready corpus outputs.