english-corpus-prep

Name: english-corpus-prep
Brand: MuleRun
Author: Merlin

by Merlin

00Feb 7, 2026Visit Source

Build corpus-ready English TXT data from mixed file formats. Use when Codex needs to ingest raw text from PDF, TXT/Markdown, HTML/XML, DOCX, JSON/JSONL, CSV/TSV, or unknown text-like files; detect input formats at the start; clean and normalize extracted text; and produce presentable, analysis-ready corpus outputs.