# region-aware-matching

> Spatial region-aware cell matching for CODEX/scRNAseq integration

- Author: smith6jt-cop
- Repository: smith6jt-cop/Skills_Registry
- Version: 20260121184736
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/smith6jt-cop/Skills_Registry
- Web: https://mule.run/skillshub/@@smith6jt-cop/Skills_Registry~region-aware-matching:20260121184736

---

---
name: region-aware-matching
description: "Spatial region-aware cell matching for CODEX/scRNAseq integration"
author: smith6jt
date: 2024-12-27
---

# Region-Aware Matching - Research Notes

## Experiment Overview
| Item | Details |
|------|---------|
| **Date** | 2024-12-27 |
| **Goal** | Incorporate tissue heterogeneity (B cell follicles, T cell zones, etc.) into MaxFuse matching |
| **Environment** | Python 3.12, MaxFuse, scanpy, sklearn |
| **Status** | Implemented (3 approaches) |

## Context
Standard MaxFuse treats tissue as homogeneous during cell matching. In reality, tissues like spleen have distinct regions (B cell follicles, T cell zones, red pulp) where certain cell types should preferentially match. Without region awareness, B cells from scRNAseq might incorrectly match to CODEX cells in T cell zones.

## Verified Workflow

### Three-Pronged Approach
1. **Prior-weighted distance interpolation** - Encode biological expectations
2. **Neighborhood-augmented features** - Add spatial context to CODEX cells
3. **Post-hoc filtering** - Remove biologically implausible matches

### Key Functions Added to spatial_utils.py

```python
def detect_tissue_regions(locations, marker_expression, marker_names,
                          marker_to_region, n_neighbors=30, min_cluster_size=10,
                          eps_quantile=0.1):
    """
    Auto-detect tissue regions using:
    1. Classify cells by dominant marker expression (z-score > 0.5)
    2. Spatially cluster cells of each type using DBSCAN
    3. Assign region labels based on marker identity + spatial coherence
    """

def compute_region_celltype_prior(celltype_to_region_weights, rna_labels,
                                   spatial_regions, default_weight=1.0):
    """
    Build prior distance matrix for interpolation.
    Lower weight = more compatible (e.g., 0.1 for B cells in B follicles)
    Higher weight = less compatible (e.g., 5.0 for B cells in red pulp)
    """

def compute_neighborhood_augmented_features(features, locations, labels,
                                             n_neighbors=15, wt_on_features=0.7):
    """
    Augment features with spatial neighborhood composition.
    Cells near B cell follicles will have high B_cell neighbor counts.
    """
```

### Usage Pattern
```python
# 1. Detect tissue regions from CODEX markers
marker_to_region = {
    'CD20': 'B_follicle',
    'CD3e': 'T_zone',
    'CD68': 'Red_pulp'
}
regions, region_info = detect_tissue_regions(
    locations, marker_expression, marker_names, marker_to_region
)

# 2. Define prior weights
celltype_to_region_weights = {
    'B_cell': {'B_follicle': 0.1, 'T_zone': 2.0, 'Red_pulp': 5.0},
    'T_cell': {'B_follicle': 2.0, 'T_zone': 0.1, 'Red_pulp': 3.0},
}

# 3. Compute prior distance matrix
prior_dist = compute_region_celltype_prior(
    celltype_to_region_weights, rna_labels, regions
)

# 4. Interpolate with embedding distance
# final_dist = (1 - wt_on_base_dist) * embed_dist + wt_on_base_dist * prior_dist
```

## Failed Attempts (Critical)

| Attempt | Why it Failed | Lesson Learned |
|---------|---------------|----------------|
| Hard region filtering | Too restrictive, lost valid matches | Use soft priors instead of hard constraints |
| Simple k-means on locations | Didn't capture irregular region shapes | DBSCAN better for tissue regions |
| Global marker thresholds | Batch effects across tissue | Use z-score normalization per marker |
| Matching only within regions | Some cell types span regions | Allow cross-region matches with penalty |

## Final Parameters

```yaml
# Region detection
n_neighbors: 30          # For k-NN density estimation
min_cluster_size: 10     # Minimum cells to form a region
eps_quantile: 0.1        # DBSCAN eps from k-NN distance distribution
z_score_threshold: 0.5   # Marker expression threshold

# Prior weights (tune per dataset)
compatible_weight: 0.1   # Expected cell type in region
neutral_weight: 1.0      # No prior knowledge
incompatible_weight: 5.0 # Unexpected cell type in region

# Distance interpolation
wt_on_base_dist: 0.3     # Weight on prior (0.2-0.4 works well)

# Neighborhood features
spatial_n_neighbors: 15
wt_on_features: 0.7      # Weight on expression vs neighborhood
```

## Key Insights
- Prior weights are log-transformed for smoother distance scaling
- Region detection works best with 2-4 marker genes per region
- Neighborhood features help even without explicit priors
- Post-hoc filtering catches remaining errors but loses some matches
- Start with weak priors (wt_on_base_dist=0.2), increase if needed

## References
- MaxFuse paper: Cross-modal matching with fuzzy smoothed embedding
- DBSCAN: Density-based spatial clustering
- Spleen tissue organization: B follicles, T zones, red/white pulp