# math-extractor
> Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file.
- Author: Develata
- Repository: Develata/Deve-Skills
- Version: 20260129203735
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Develata/Deve-Skills
- Web: https://mule.run/skillshub/@@Develata/Deve-Skills~math-extractor:20260129203735
---
---
name: math-extractor
description: Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file.
---
# Math Extractor
This skill extracts mathematical definitions, theorems, lemmas, propositions, and proofs from documents.
## Input Schema
```xml
Path to the source file (pdf/md/tex/txt)
```
## Logic & Workflow
The Agent must follow this Chain of Thought (CoT):
1. **Env Check**: First, verify that `scripts/processor.py` can access the necessary API keys (MinerU & LLM) from the environment. If missing, return a configuration error.
2. **Validation**: Check file extension. If not .pdf/.md/.tex/.txt, return "不支持当前文件格式".
3. **Conversion**:
* If PDF: Call `convert_pdf`. The script internally uses the pre-configured MinerU key.
* If conversion fails (or key missing), return "未设定好pdf转化为md的工具".
4. **Preprocessing**:
* Call `clean_and_chunk` (implemented in `clean_content`).
* Aggressively remove images, TOCs, and References to save tokens.
5. **Extraction (Batch AI)**:
* Call `batch_extract_math` (implemented in `batch_extract`).
* The script uses the pre-configured LLM credentials to process chunks in parallel.
6. **Merge & Output**:
* Save to `{filename}_extracted.md` and return the path.
## Usage
To use this skill, execute the python script with the file path.
**Required Environment Variables:**
* `EXTRACTION_API_KEY`: API Key for LLM (e.g., OpenAI, DeepSeek).
* `EXTRACTION_BASE_URL`: Base URL for LLM API (default: `https://api.openai.com/v1`).
**Optional Environment Variables:**
* `MINERU_API_KEY`: Required only for PDF conversion.
* `MINERU_BASE_URL`: Base URL for MinerU API (default: `https://api.mineru.com/v1`).
* `LLM_MODEL`: Model name to use (default: `gpt-4o`).
```bash
python scripts/processor.py
```
## Features
* **Robust PDF Conversion**: Uses MinerU for high-quality PDF to Markdown conversion.
* **Smart Chunking**: Splits text by paragraphs to avoid breaking math formulas.
* **Cost Optimization**: Heuristically filters out non-math chunks to save tokens.
* **Math Protection**: Whitelists safe HTML tags to prevent accidental deletion of math inequalities (e.g., `a < b`).
* **Encoding Fallback**: Automatically tries UTF-8, GBK, and Latin-1 encodings.
* **Retry Logic**: Built-in retries for API calls to handle network instability.