# ehr-semantic-compressor

> AI-powered EHR summarization using Transformer architecture to extract key clinical information from lengthy medical records

- Author: Rowtion
- Repository: aipoch/skills-collection
- Version: 20260210095832
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-10
- Source: https://github.com/aipoch/skills-collection
- Web: https://mule.run/skillshub/@@aipoch/skills-collection~ehr-semantic-compressor:20260210095832

---

---
name: ehr-semantic-compressor
description: AI-powered EHR summarization using Transformer architecture to extract
  key clinical information from lengthy medical records
version: 1.0.0
category: Clinical
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---

# EHR Semantic Compressor

## Overview

AI-powered EHR summarization using Transformer architecture to extract key clinical information from lengthy medical records. This skill processes lengthy Electronic Health Record (EHR) documents and generates structured, clinically accurate summaries.

**Technical Difficulty**: High

## When to Use

- Input contains lengthy EHR documents (1600+ words) requiring summarization
- Clinical records need structured extraction of key information
- Quick review of patient history, medications, allergies, or diagnoses is needed
- Medical documentation requires compression while maintaining accuracy

## Core Features

1. **Fast Processing**: Process lengthy EHR documents (1600+ words) in 10-20 seconds
2. **Structured Summaries**: Generate bullet-point summaries (200-300 words)
3. **Critical Information Extraction**:
   - Patient allergies and adverse reactions
   - Family medical history
   - Current and past medications
   - Diagnoses and conditions
   - Vital signs and lab results
   - Procedures and surgeries
4. **Clinical Accuracy**: Maintains completeness of medical information

## Usage

### Basic Usage

```bash
python scripts/main.py --input ehr_document.txt --output summary.json
```

### Input Format

```json
{
  "ehr_text": "Full EHR document text...",
  "max_length": 300,
  "extract_sections": ["allergies", "medications", "diagnoses", "family_history"]
}
```

### Output Format

```json
{
  "status": "success",
  "data": {
    "summary": "Structured bullet-point summary...",
    "extracted_sections": {
      "allergies": [...],
      "medications": [...],
      "diagnoses": [...],
      "family_history": [...]
    },
    "metadata": {
      "original_length": 2500,
      "summary_length": 280,
      "compression_ratio": 0.89
    }
  }
}
```

### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| ehr_text | string | Yes | Full EHR document text |
| max_length | number | No | Maximum summary length in words (default: 300) |
| extract_sections | array | No | Sections to extract (default: all) |

## Technical Details

### Architecture

- **Base Model**: Transformer-based encoder-decoder architecture
- **Medical Domain Adaptation**: Fine-tuned on clinical text corpora
- **Section Extraction**: Rule-based + ML hybrid approach for structured data
- **Processing Pipeline**: Text segmentation -> Summarization -> Section extraction -> Output formatting

### Dependencies

See `references/requirements.txt` for complete list.

Key dependencies:
- transformers >= 4.30.0
- torch >= 2.0.0
- spacy >= 3.6.0
- scispacy >= 0.5.3

### Performance

- **Processing Time**: 10-20 seconds for 1600+ word documents
- **Memory**: Requires ~2GB RAM
- **Output Length**: 200-300 words (configurable)
- **Compression Ratio**: ~85-90%

## References

- `references/requirements.txt` - Python dependencies
- `references/guidelines.md` - Clinical summarization guidelines
- `references/sample_input.json` - Example input format
- `references/sample_output.json` - Example output format

## Safety & Compliance

- No external API calls or service dependencies
- All processing performed locally
- No patient data transmitted outside the system
- Error messages are semantic and do not expose technical details

## Testing

Run unit tests:
```bash
cd scripts
python test_main.py
```

## Error Handling

All errors return semantic messages:

```json
{
  "status": "error",
  "error": {
    "type": "input_validation_error",
    "message": "EHR text is empty or too short",
    "suggestion": "Provide EHR text with at least 100 words"
  }
}
```

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites

```bash
# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support