# tfidf-search

> Implements TF-IDF based search engines for text datasets using vector space models and cosine similarity. Use when building search functionality, finding similar documents, ranking text by relevance, or working with text retrieval systems.

- Author: Kartheek Akella
- Repository: asvskartheek/awesome-claude-skills
- Version: 20260117232058
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/asvskartheek/awesome-claude-skills
- Web: https://mule.run/skillshub/@@asvskartheek/awesome-claude-skills~tfidf-search:20260117232058

---

---
name: tfidf-search
description: Implements TF-IDF based search engines for text datasets using vector space models and cosine similarity. Use when building search functionality, finding similar documents, ranking text by relevance, or working with text retrieval systems.
allowed-tools: Read, Write, Bash, Grep, Glob
---

# TF-IDF Search Engine

Implement search engines using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization and cosine similarity for ranking documents by relevance to a query.

## When to Use This Skill

- Building search functionality for text datasets
- Finding similar documents or passages
- Ranking documents by relevance to a query
- Implementing information retrieval systems
- Analyzing song lyrics, articles, documents, or any text corpus

## Core Concepts

**Vector Space Model (VSM)**: Represents text as vectors where each dimension corresponds to a unique word in the corpus.

**TF-IDF Score**: Combines term frequency (how often a word appears in a document) with inverse document frequency (how unique the word is across all documents). Common words like "the" get lower scores; rare, distinctive words get higher scores.

**Cosine Similarity**: Measures the angle between two vectors to determine document similarity. Range: -1 to 1, where 1 means identical direction (most similar).

## Implementation Workflow

### Step 1: Install Required Packages

Check project instructions for package management. For uv-based projects:

```bash
uv add numpy pandas scikit-learn
```

For pip-based projects:

```bash
pip install numpy pandas scikit-learn
```

### Step 2: Prepare Your Dataset

Your dataset should be in CSV format with at least one text column containing the documents to search.

Example structure:
```
song,artist,text
"Song Title","Artist Name","lyrics text here..."
```

### Step 3: Use the Helper Script

Run the TF-IDF search implementation:

```bash
python .claude/skills/tfidf-search/scripts/tfidf_search.py <csv_file> <text_column> "<query>" [--top_k 10]
```

Parameters:
- `csv_file`: Path to your CSV dataset
- `text_column`: Name of the column containing text to search
- `query`: Search query string (in quotes)
- `--top_k`: Number of top results to return (default: 10)

**Example**:
```bash
python .claude/skills/tfidf-search/scripts/tfidf_search.py songdata.csv text "Take it easy with me, please" --top_k 10
```

### Step 4: Custom Implementation

For custom implementations or integration into existing code:

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load dataset
df = pd.read_csv('your_data.csv')

# Create and fit vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text_column'])

# Transform query
query = "your search query"
query_vec = vectorizer.transform([query])

# Calculate similarities
results = cosine_similarity(X, query_vec)

# Get top results
top_indices = results.argsort(axis=0)[-10:][::-1].flatten()
for idx in top_indices:
    print(f"Score: {results[idx][0]:.4f} - {df.iloc[idx]['title']}")
```

## Key Implementation Details

**Query as List**: The `transform()` method expects a list of documents, even for a single query:
```python
query_vec = vectorizer.transform([query])  # Note the brackets
```

**Shape Verification**: Use `.shape` to verify dimensions:
```python
print(f"Corpus shape: {X.shape}")  # (n_documents, n_features)
print(f"Query shape: {query_vec.shape}")  # (1, n_features)
```

**Sorting Results**: Get top-k results using argsort:
```python
# For single query (results is 2D array)
top_k = 10
top_indices = results.argsort(axis=0)[-top_k:][::-1].flatten()
```

## Advanced Options

### TfidfVectorizer Parameters

Customize the vectorizer for better results:

```python
vectorizer = TfidfVectorizer(
    max_features=5000,        # Limit vocabulary size
    min_df=2,                 # Ignore terms appearing in < 2 docs
    max_df=0.8,               # Ignore terms appearing in > 80% of docs
    ngram_range=(1, 2),       # Include unigrams and bigrams
    stop_words='english'      # Remove common English words
)
```

### Handling Large Datasets

For large datasets, consider:
1. Using sparse matrix operations (scikit-learn handles this automatically)
2. Limiting vocabulary with `max_features`
3. Processing in batches if memory is constrained

### Improving Search Quality

**Preprocessing**: Clean text before vectorization:
```python
df['text'] = df['text'].str.lower()  # Lowercase
df['text'] = df['text'].str.replace('[^a-zA-Z\s]', '', regex=True)  # Remove punctuation
```

**N-grams**: Include phrases, not just single words:
```python
vectorizer = TfidfVectorizer(ngram_range=(1, 2))  # unigrams + bigrams
```

**Stop Words**: Remove common words that don't help distinguish documents:
```python
vectorizer = TfidfVectorizer(stop_words='english')
```

## Common Issues

**Low Similarity Scores**: Normal for TF-IDF. Scores of 0.1-0.3 can still indicate relevant matches. Focus on relative ranking, not absolute scores.

**Out of Vocabulary**: Query words not in the training corpus get zero weight. Preprocess queries the same way as documents.

**Memory Errors**: Reduce `max_features` or process smaller batches.

## References

For implementation examples and variations, see [examples.md](examples.md).

## Performance Considerations

- **Vectorization**: O(n × m) where n = documents, m = avg words per doc
- **Query Processing**: O(m) for single query transformation
- **Similarity Calculation**: O(n) for comparing query against n documents
- **Memory**: Sparse matrices keep memory usage manageable for large vocabularies