# docx > DOCX(.docx) creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use this skill when working with professional Word documents: creating new docs, editing existing docs, doing redlines/tracked changes, or extracting/inspecting content. - Author: liubaopeng - Repository: sdlbp/my_cursor - Version: 20260129144140 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/sdlbp/my_cursor - Web: https://mule.run/skillshub/@@sdlbp/my_cursor~docx:20260129144140 --- --- name: docx description: "DOCX(.docx) creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use this skill when working with professional Word documents: creating new docs, editing existing docs, doing redlines/tracked changes, or extracting/inspecting content." license: Proprietary. LICENSE.txt has complete terms --- # DOCX creation, editing, and analysis ## Overview A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks. ## Inputs / Outputs (Cursor-friendly contract) ### Inputs to collect (ask only if missing) - **Target file**: path to the `.docx` (or “create from scratch”). - **Intent**: `read/analyze` | `create` | `edit` | `redline` (tracked changes) | `comments`. - **Change policy**: preserve formatting? accept/reject existing tracked changes? - **Verification expectation**: text-only verification (markdown diff) vs visual verification (images). ### Outputs to produce - **Change plan**: a batched checklist (3–10 changes per batch) with location hints. - **Artifacts** (when applicable): `current.md`, `verification.md`, and/or page images. - **Verification report**: what was checked, and what passed/failed. ## Workflow Decision Tree ### Reading/Analyzing Content Use "Text extraction" or "Raw XML access" sections below ### Creating New Document Use "Creating a new Word document" workflow ### Editing Existing Document - **Your own document + simple changes** Use "Basic OOXML editing" workflow - **Someone else's document** Use **"Redlining workflow"** (recommended default) - **Legal, academic, business, or government docs** Use **"Redlining workflow"** (required) ## Reading and analyzing content ### Text extraction If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes: ```bash # Convert document to markdown with tracked changes pandoc --track-changes=all path-to-file.docx -o output.md # Options: --track-changes=accept/reject/all ``` #### Dependency check + fallback - If `pandoc` is unavailable, fall back to **Raw XML access** and extract text from `word/document.xml` (structure is noisier but works for locating content). ### Raw XML access You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents. #### Unpacking a file `python ooxml/scripts/unpack.py ` #### Key file structures * `word/document.xml` - Main document contents * `word/comments.xml` - Comments referenced in document.xml * `word/media/` - Embedded images and media files * Tracked changes use `` (insertions) and `` (deletions) tags ## Creating a new Word document When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript. ### Workflow 1. **Read only what you need**: Open [`docx-js.md`](docx-js.md) and jump to sections relevant to the requested output (basic paragraphs, lists, tables, images, headers/footers). Avoid “read everything” unless you’re blocked. 2. Create a JavaScript/TypeScript file using `Document`, `Paragraph`, `TextRun` components (if dependencies are missing, see **Dependencies** below). 3. Export as .docx using Packer.toBuffer() ## Editing an existing Word document When editing an existing Word document, prefer the repo’s **Python OOXML helpers** (see `docx/scripts/document.py` and the patterns in `ooxml.md`). These helpers provide both high-level operations and direct DOM access for complex cases. ### Workflow 1. **Read only what you need**: Open [`ooxml.md`](ooxml.md) and focus on: - Document helper/API usage (how to load/save, find nodes) - Tracked change patterns (``, ``, rsid handling) - Comments structure (`word/comments.xml`, references in `document.xml`) 2. Unpack the document: `python ooxml/scripts/unpack.py ` 3. Create and run a Python script using the repo’s OOXML helpers/patterns (see relevant sections in `ooxml.md`, and `docx/scripts/document.py` for implementation). 4. Pack the final document: `python ooxml/scripts/pack.py ` The repo’s OOXML helpers provide both high-level methods for common operations and direct DOM access for complex scenarios. ## Redlining workflow for document review This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically. **Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next. **Principle: Minimal, Precise Edits** When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `` element from the original and reusing it. Example - Changing "30 days" to "60 days" in a sentence: ```python # BAD - Replaces entire sentence 'The term is 30 days.The term is 60 days.' # GOOD - Only marks what changed, preserves original for unchanged text 'The term is 3060 days.' ``` ### Tracked changes workflow 1. **Get markdown representation**: Convert document to markdown with tracked changes preserved: ```bash pandoc --track-changes=all path-to-file.docx -o current.md ``` 2. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches: **Location methods** (for finding changes in XML): - Section/heading numbers (e.g., "Section 3.2", "Article IV") - Paragraph identifiers if numbered - Grep patterns with unique surrounding text - Document structure (e.g., "first paragraph", "signature block") - **DO NOT use markdown line numbers** - they don't map to XML structure **Batch organization** (group 3-10 related changes per batch): - By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates" - By type: "Batch 1: Date corrections", "Batch 2: Party name changes" - By complexity: Start with simple text replacements, then tackle complex structural changes - Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6" 3. **Read documentation and unpack**: - Read the necessary sections of [`ooxml.md`](ooxml.md) (Document helper/API + tracked change patterns). Avoid “read everything” unless you’re blocked. - **Unpack the document**: `python ooxml/scripts/unpack.py ` - **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b. 4. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach: - Makes debugging easier (smaller batch = easier to isolate errors) - Allows incremental progress - Maintains efficiency (batch size of 3-10 changes works well) **Suggested batch groupings:** - By document section (e.g., "Section 3 changes", "Definitions", "Termination clause") - By change type (e.g., "Date changes", "Party name updates", "Legal term replacements") - By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document") For each batch of related changes: **a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `` elements. **b. Create and run script**: Use `get_node` to find nodes, implement changes, then `doc.save()`. See **"Document Library"** section in ooxml.md for patterns. **Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run. 5. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx: ```bash python ooxml/scripts/pack.py unpacked reviewed-document.docx ``` 6. **Final verification**: Do a comprehensive check of the complete document: - Convert final document to markdown: ```bash pandoc --track-changes=all reviewed-document.docx -o verification.md ``` - Verify ALL changes were applied correctly: ```bash grep "original phrase" verification.md # Should NOT find it grep "replacement phrase" verification.md # Should find it ``` - Check that no unintended changes were introduced ## Converting Documents to Images To visually analyze Word documents, convert them to images using a two-step process: 1. **Convert DOCX to PDF**: ```bash soffice --headless --convert-to pdf document.docx ``` 2. **Convert PDF pages to JPEG images**: ```bash pdftoppm -jpeg -r 150 document.pdf page ``` This creates files like `page-1.jpg`, `page-2.jpg`, etc. Options: - `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance) - `-jpeg`: Output JPEG format (use `-png` for PNG if preferred) - `-f N`: First page to convert (e.g., `-f 2` starts from page 2) - `-l N`: Last page to convert (e.g., `-l 5` stops at page 5) - `page`: Prefix for output files Example for specific range: ```bash pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5 ``` ## Code Style Guidelines **IMPORTANT**: When generating code for DOCX operations: - Write concise code - Avoid verbose variable names and redundant operations - Avoid unnecessary print statements ## Dependencies Required dependencies (install if not available). Prefer platform-appropriate commands: ### macOS (Homebrew) - **pandoc**: `brew install pandoc` (text extraction) - **LibreOffice**: `brew install --cask libreoffice` (DOCX → PDF via `soffice`) - **Poppler**: `brew install poppler` (PDF → images via `pdftoppm`) - **docx (JS library)**: prefer project-local install (e.g. `npm add docx`) over global; use global only if required. ### Debian/Ubuntu - **pandoc**: `sudo apt-get install pandoc` - **LibreOffice**: `sudo apt-get install libreoffice` - **Poppler**: `sudo apt-get install poppler-utils` ### Python - **defusedxml**: `pip install defusedxml` (secure XML parsing) ### Quick checks Use these to confirm availability before relying on a tool: ```bash command -v pandoc command -v soffice command -v pdftoppm python -c "import defusedxml; print('defusedxml ok')" ```