# mteb-retrieve > This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification. - Author: Cameron - Repository: letta-ai/skills - Version: 20260121103813 - Stars: 49 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/letta-ai/skills - Web: https://mule.run/skillshub/@@letta-ai/skills~mteb-retrieve:20260121103813 --- --- name: mteb-retrieve description: This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification. --- # MTEB Retrieve ## Overview This skill guides semantic similarity retrieval tasks where documents must be ranked by their similarity to a query using embedding models. These tasks typically involve loading documents, computing embeddings, calculating similarity scores, and identifying documents at specific ranks. ## Workflow ### Step 1: Data Inspection and Preprocessing Before computing embeddings, thoroughly inspect the input data format: 1. **Examine raw file contents** - Read a sample of lines to understand the actual format 2. **Identify formatting artifacts** - Look for: - Line number prefixes (e.g., `1→`, `2→`, `11→`) - Index markers or delimiters - Whitespace padding or alignment characters - Header rows or metadata lines 3. **Clean the data** - Remove any non-semantic content: - Strip line numbers and prefixes using regex (e.g., `re.sub(r'^\s*\d+→', '', line)`) - Remove leading/trailing whitespace - Filter empty lines 4. **Validate preprocessing** - Print sample cleaned documents to verify they contain only semantic content Example preprocessing pattern: ```python import re def clean_line(line): # Remove line number prefix like " 1→" or "11→" cleaned = re.sub(r'^\s*\d+[→\t]', '', line) return cleaned.strip() documents = [clean_line(line) for line in raw_lines if clean_line(line)] ``` ### Step 2: Model Selection Select an appropriate embedding model for the content language and domain: 1. **Check model language** - Models often have language indicators in their names: - `zh` = Chinese (e.g., `bge-small-zh-v1.5`) - `en` = English (e.g., `bge-small-en-v1.5`) - No suffix often means multilingual or English 2. **Match model to content** - Using a Chinese-optimized model for English text (or vice versa) produces suboptimal embeddings 3. **Consider model size** - Larger models generally produce better embeddings but are slower ### Step 3: Embedding Computation When computing embeddings: 1. **Normalize embeddings** - Use `normalize_embeddings=True` to enable cosine similarity via dot product 2. **Batch processing** - For large document sets, process in batches to manage memory 3. **Verify dimensions** - Confirm embedding dimensions match expectations for the model ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('model-name') doc_embeddings = model.encode(documents, normalize_embeddings=True) query_embedding = model.encode(query, normalize_embeddings=True) ``` ### Step 4: Similarity Computation and Ranking 1. **Compute similarities** - Use dot product for normalized embeddings (equivalent to cosine similarity) 2. **Handle ties** - Be aware that identical similarity scores produce arbitrary ordering 3. **Use correct indexing** - For k-th highest, use index `k-1` after sorting in descending order ```python import numpy as np similarities = np.dot(doc_embeddings, query_embedding) sorted_indices = np.argsort(similarities)[::-1] # Descending order # For 5th highest: index 4 (0-indexed) fifth_highest_idx = sorted_indices[4] fifth_highest_doc = documents[fifth_highest_idx] ``` ### Step 5: Result Verification Before writing final results, verify correctness: 1. **Print document count** - Confirm expected number of documents were loaded 2. **Show sample documents** - Display first few cleaned documents to verify preprocessing 3. **Display top-k results** - Print at least the top 5-10 documents with their similarity scores 4. **Cross-check output format** - Ensure the output contains only the semantic content, not formatting artifacts ```python # Verification checklist print(f"Total documents: {len(documents)}") print(f"Sample document: {documents[0][:100]}...") print("\nTop 10 by similarity:") for i in range(min(10, len(sorted_indices))): idx = sorted_indices[i] print(f" {i+1}. [{similarities[idx]:.4f}] {documents[idx][:50]}...") ``` ## Common Pitfalls ### Data Format Issues - **Line number prefixes** - Input files often include line numbers (e.g., `1→Text`) that corrupt embeddings if not removed - **Invisible characters** - Watch for tabs, non-breaking spaces, or Unicode formatting characters - **Mixed encodings** - Explicitly specify file encoding (`encoding='utf-8'`) ### Model Mismatches - **Language mismatch** - Using language-specific models on wrong-language content - **Version confusion** - Ensure model revision matches expected behavior ### Indexing Errors - **Off-by-one errors** - k-th highest uses index `k-1` in 0-indexed arrays - **Original vs sorted indices** - Track the mapping between sorted positions and original document indices ### Verification Gaps - **No sanity checks** - Always verify document count, sample content, and score distribution - **Missing tie handling** - Document when ties exist and how they affect results