# plagiarism-checker-pre-screener

> Use when: User provides text/document and asks to check originality, 
detect plagiarism, assess similarity, or rewrite high-duplicate content.
Triggers: "check plagiarism", "originality check", "similarity detection",
"改写重复内容", "降重", "查重", "原创性检测", "抄袭检查"
Input: Text content or document (txt, md, docx support via text extraction)
Output: Originality score, highlighted duplicate/similar paragraphs, paraphrasing suggestions

- Author: Rowtion
- Repository: aipoch/skills-collection
- Version: 20260210095832
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-10
- Source: https://github.com/aipoch/skills-collection
- Web: https://mule.run/skillshub/@@aipoch/skills-collection~plagiarism-checker-pre-screener:20260210095832

---

---
name: plagiarism-checker-pre-screener
description: "Use when: User provides text/document and asks to check originality,\
  \ \ndetect plagiarism, assess similarity, or rewrite high-duplicate content.\nTriggers:\
  \ \"check plagiarism\", \"originality check\", \"similarity detection\",\n\"改写重复内容\"\
  , \"降重\", \"查重\", \"原创性检测\", \"抄袭检查\"\nInput: Text content or document (txt, md,\
  \ docx support via text extraction)\nOutput: Originality score, highlighted duplicate/similar\
  \ paragraphs, paraphrasing suggestions"
version: 1.0.0
category: Research
tags: []
author: AIPOCH
license: MIT
status: Draft
risk_level: Medium
skill_type: Tool/Script
owner: AIPOCH
reviewer: ''
last_updated: '2026-02-06'
---

# Plagiarism Checker Pre-Screener

Pre-screens text for potential plagiarism by detecting similarity patterns and providing paraphrasing suggestions for high-duplicate sections.

## Technical Difficulty: High ⚠️
> **AI自主验收状态**: 需人工检查
> This skill uses advanced NLP techniques. Results should be manually reviewed before submission.

## Features

1. **Text Similarity Detection**: Identifies potentially plagiarized or highly similar text segments
2. **Originality Scoring**: Provides overall originality percentage (0-100%)
3. **Paraphrasing Suggestions**: Offers AI-powered rewriting for flagged sections
4. **Segment Analysis**: Breaks text into sentences/paragraphs for granular checking

## Usage

### Basic Check
```bash
python scripts/main.py --input "Your text here" --threshold 0.75
```

### File Analysis
```bash
python scripts/main.py --file document.txt --output report.json
```

### With Paraphrasing
```bash
python scripts/main.py --input "text" --paraphrase --style academic
```

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--input` | string | - | Direct text input (alternative to --file) |
| `--file` | path | - | Path to text file to analyze |
| `--threshold` | float | 0.70 | Similarity threshold (0.0-1.0) for flagging |
| `--paraphrase` | flag | false | Enable paraphrasing suggestions |
| `--style` | string | neutral | Paraphrasing style: academic/formal/casual/neutral |
| `--output` | path | stdout | Output file path (JSON format) |
| `--segments` | string | sentence | Analysis unit: sentence/paragraph |

## Output Format

```json
{
  "originality_score": 85.5,
  "total_segments": 12,
  "flagged_segments": 2,
  "segments": [
    {
      "index": 1,
      "text": "Original sentence text...",
      "similarity_score": 0.92,
      "flagged": true,
      "paraphrase_suggestion": "Rewritten version..."
    }
  ],
  "summary": "Text shows high originality with minor flagged sections"
}
```

## Implementation Notes

- Uses TF-IDF + Cosine Similarity for local similarity detection
- Employs semantic embeddings for meaning-based comparison
- Paraphrasing uses transformer-based models
- No external API calls required; runs locally

## References

- `references/algorithm.md` - Technical algorithm details
- `references/paraphrasing_guide.md` - Paraphrasing methodology

## Limitations

1. Cannot access external databases (internet search required for comprehensive checking)
2. Local similarity only - won't catch plagiarism from external sources
3. Paraphrasing quality depends on input text complexity
4. Processing time increases with document length

## Safety & Privacy

- All processing is local - no text sent to external APIs
- Suitable for sensitive/confidential documents
- No data retention after analysis completes

## Risk Assessment

| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |

## Security Checklist

- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites

```bash
# Python dependencies
pip install -r requirements.txt
```

## Evaluation Criteria

### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable

### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time

## Lifecycle Status

- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**: 
  - Performance optimization
  - Additional feature support