# math-extractor > Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file. - Author: Develata - Repository: Develata/Deve-Skills - Version: 20260129203735 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Develata/Deve-Skills - Web: https://mule.run/skillshub/@@Develata/Deve-Skills~math-extractor:20260129203735 --- --- name: math-extractor description: Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file. --- # Math Extractor This skill extracts mathematical definitions, theorems, lemmas, propositions, and proofs from documents. ## Input Schema ```xml Path to the source file (pdf/md/tex/txt) ``` ## Logic & Workflow The Agent must follow this Chain of Thought (CoT): 1. **Env Check**: First, verify that `scripts/processor.py` can access the necessary API keys (MinerU & LLM) from the environment. If missing, return a configuration error. 2. **Validation**: Check file extension. If not .pdf/.md/.tex/.txt, return "不支持当前文件格式". 3. **Conversion**: * If PDF: Call `convert_pdf`. The script internally uses the pre-configured MinerU key. * If conversion fails (or key missing), return "未设定好pdf转化为md的工具". 4. **Preprocessing**: * Call `clean_and_chunk` (implemented in `clean_content`). * Aggressively remove images, TOCs, and References to save tokens. 5. **Extraction (Batch AI)**: * Call `batch_extract_math` (implemented in `batch_extract`). * The script uses the pre-configured LLM credentials to process chunks in parallel. 6. **Merge & Output**: * Save to `{filename}_extracted.md` and return the path. ## Usage To use this skill, execute the python script with the file path. **Required Environment Variables:** * `EXTRACTION_API_KEY`: API Key for LLM (e.g., OpenAI, DeepSeek). * `EXTRACTION_BASE_URL`: Base URL for LLM API (default: `https://api.openai.com/v1`). **Optional Environment Variables:** * `MINERU_API_KEY`: Required only for PDF conversion. * `MINERU_BASE_URL`: Base URL for MinerU API (default: `https://api.mineru.com/v1`). * `LLM_MODEL`: Model name to use (default: `gpt-4o`). ```bash python scripts/processor.py ``` ## Features * **Robust PDF Conversion**: Uses MinerU for high-quality PDF to Markdown conversion. * **Smart Chunking**: Splits text by paragraphs to avoid breaking math formulas. * **Cost Optimization**: Heuristically filters out non-math chunks to save tokens. * **Math Protection**: Whitelists safe HTML tags to prevent accidental deletion of math inequalities (e.g., `a < b`). * **Encoding Fallback**: Automatically tries UTF-8, GBK, and Latin-1 encodings. * **Retry Logic**: Built-in retries for API calls to handle network instability.