# cross-modal-normalization

> Scale alignment for RNA-protein cross-modal integration - BOTH modalities must be z-scored

- Author: smith6jt-cop
- Repository: smith6jt-cop/Skills_Registry
- Version: 20260121184736
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/smith6jt-cop/Skills_Registry
- Web: https://mule.run/skillshub/@@smith6jt-cop/Skills_Registry~cross-modal-normalization:20260121184736

---

---
name: cross-modal-normalization
description: "Scale alignment for RNA-protein cross-modal integration - BOTH modalities must be z-scored"
author: Claude Code
date: 2025-01-20
---

# Cross-Modal Normalization - Research Notes

## Experiment Overview
| Item | Details |
|------|---------|
| **Date** | 2025-01-20 |
| **Goal** | Fix matching quality issues caused by scale mismatch between RNA and protein |
| **Environment** | Python 3.10, scanpy 1.9+, sklearn |
| **Status** | Success |

## Context

MaxFuse matching quality was poor with cells matching across clusters and "B-cell outliers" that weren't actually B cells. Investigation revealed a **16x variance mismatch** between modalities:

- **RNA**: normalize_total → log1p → z-score → mean≈0, std≈0.8, range [-1.5, 5]
- **Protein**: Used as-is ("pre-scaled") → mean≈0.4, std≈0.08, range [0, 1]

The matching algorithm over-weighted RNA features because they had 16x higher variance.

## The Problem: Scale Mismatch

```
RNA variance:     ~0.64 (std² = 0.8²)
Protein variance: ~0.006 (std² = 0.08²)
Ratio:           ~100x (RNA dominates matching)
```

When computing distances for matching, features with higher variance dominate. If RNA has 100x the variance of protein, the matching effectively ignores protein information.

## Verified Workflow

### Correct Normalization (BOTH z-scored)

```python
from sklearn.preprocessing import StandardScaler
from scipy import sparse

# ============================================================
# RNA NORMALIZATION (standard pipeline)
# ============================================================
sc.pp.normalize_total(rna_shared_adata, target_sum=1e4)
sc.pp.log1p(rna_shared_adata)
sc.pp.scale(rna_shared_adata, zero_center=True, max_value=5)
rna_shared_normalized = rna_shared_adata.X.copy()
if sparse.issparse(rna_shared_normalized):
    rna_shared_normalized = rna_shared_normalized.toarray()

# ============================================================
# PROTEIN NORMALIZATION - MUST Z-SCORE TO MATCH RNA
# ============================================================
protein_shared_raw = protein_shared_adata.X.copy()
if sparse.issparse(protein_shared_raw):
    protein_shared_raw = protein_shared_raw.toarray()

# Z-score protein to match RNA scale
scaler = StandardScaler()
protein_shared_normalized = scaler.fit_transform(protein_shared_raw)
protein_shared_normalized = np.clip(protein_shared_normalized, -5, 5)

# ============================================================
# VERIFICATION - CRITICAL
# ============================================================
print("RNA:")
print(f"  Mean: {rna_shared_normalized.mean():.4f}")
print(f"  Std:  {rna_shared_normalized.std():.4f}")
print(f"  Range: [{rna_shared_normalized.min():.2f}, {rna_shared_normalized.max():.2f}]")

print("Protein:")
print(f"  Mean: {protein_shared_normalized.mean():.4f}")
print(f"  Std:  {protein_shared_normalized.std():.4f}")
print(f"  Range: [{protein_shared_normalized.min():.2f}, {protein_shared_normalized.max():.2f}]")

# EXPECTED OUTPUT:
# RNA:     Mean: ~0, Std: ~0.8-1.0, Range: [-1.5, 5.0]
# Protein: Mean: ~0, Std: ~0.8-1.0, Range: [-5.0, 5.0]
```

## Failed Attempts (Critical)

| Attempt | Why it Failed | Lesson Learned |
|---------|---------------|----------------|
| Protein "pre-scaled" used as-is | 16x variance mismatch → RNA dominated matching | ALWAYS z-score both modalities |
| Assuming gated data is normalized | Gating ≠ normalization, still needs z-score | Check actual statistics, don't assume |
| Only checking mean (not std) | Mean≈0 but std was 0.08 vs 0.8 | Always check BOTH mean AND std |

## Key Insights

1. **Both modalities MUST have similar scales**: mean≈0, std≈1, range ~[-5, 5]

2. **"Pre-scaled" or "normalized" doesn't mean z-scored**: Always verify with actual statistics

3. **Check the VERIFICATION output**: The normalization cell should print statistics for both modalities. If they don't match, the integration will fail.

4. **Symptoms of scale mismatch**:
   - Cells matching across different UMAP clusters
   - "Outliers" that don't match expected cell types
   - Very high canonical correlations (overfitting to dominant modality)
   - Poor downstream validation metrics

5. **Clip both to same range**: Using `np.clip(..., -5, 5)` ensures neither modality has extreme outliers that dominate matching.

## Diagnostic Checks

### Before Integration
```python
# Both should be similar
for name, data in [("RNA", rna_shared), ("Protein", protein_shared)]:
    print(f"{name}: mean={data.mean():.3f}, std={data.std():.3f}")

# WARNING signs:
# - std differs by >2x between modalities
# - mean is far from 0 for either modality
# - range is very different (e.g., [0,1] vs [-5,5])
```

### After Integration
```python
# Check if matched pairs make biological sense
# - Same cell type markers should be correlated
# - Cells should match within clusters, not across
# - Permutation test should show significance (p < 0.01)
```

## References

- Integration notebook: `notebooks/2_integration.ipynb`, cell starting with "# Normalize shared features"
- MaxFuse matching uses correlation distance, which is scale-sensitive
- sklearn StandardScaler: per-feature z-scoring