# pca-decomposition

> Reduce dimensionality of multivariate data using PCA with varimax rotation. Use when you have many correlated variables and need to identify underlying factors or reduce collinearity.

- Author: Xiangyi Li
- Repository: likaixin2000/skillsbench-exp
- Version: 20260127062443
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/likaixin2000/skillsbench-exp
- Web: https://mule.run/skillshub/@@likaixin2000/skillsbench-exp~pca-decomposition:20260127062443

---

---
name: pca-decomposition
description: Reduce dimensionality of multivariate data using PCA with varimax rotation. Use when you have many correlated variables and need to identify underlying factors or reduce collinearity.
license: MIT
---

# PCA Decomposition Guide

## Overview

Principal Component Analysis (PCA) reduces many correlated variables into fewer uncorrelated components. Varimax rotation makes components more interpretable by maximizing variance.

## When to Use PCA

- Many correlated predictor variables
- Need to identify underlying factor groups
- Reduce multicollinearity before regression
- Exploratory data analysis

## Basic PCA with Varimax Rotation
```python
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer

# Standardize data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA with varimax rotation
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)

# Get factor loadings
loadings = fa.loadings_

# Get component scores for each observation
scores = fa.transform(X_scaled)
```

## Workflow for Attribution Analysis

When using PCA for contribution analysis with predefined categories:

1. **Combine ALL variables first**, then do PCA together:
```python
# Include all variables from all categories in one matrix
all_vars = ['AirTemp', 'NetRadiation', 'Precip', 'Inflow', 'Outflow',
            'WindSpeed', 'DevelopedArea', 'AgricultureArea']
X = df[all_vars].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA on ALL variables together
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)
scores = fa.transform(X_scaled)
```

2. **Interpret loadings** to map factors to categories (optional for understanding)

3. **Use factor scores directly** for R² decomposition

**Important**: Do NOT run separate PCA for each category. Run one global PCA on all variables, then use the resulting factor scores for contribution analysis.

## Interpreting Factor Loadings

Loadings show correlation between original variables and components:

| Loading | Interpretation |
|---------|----------------|
| > 0.7 | Strong association |
| 0.4 - 0.7 | Moderate association |
| < 0.4 | Weak association |

## Example: Economic Indicators
```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer

# Variables: gdp, unemployment, inflation, interest_rate, exports, imports
df = pd.read_csv('economic_data.csv')
variables = ['gdp', 'unemployment', 'inflation',
             'interest_rate', 'exports', 'imports']

X = df[variables].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

fa = FactorAnalyzer(n_factors=3, rotation='varimax')
fa.fit(X_scaled)

# View loadings
loadings_df = pd.DataFrame(
    fa.loadings_,
    index=variables,
    columns=['RC1', 'RC2', 'RC3']
)
print(loadings_df.round(2))
```

## Choosing Number of Factors

### Option 1: Kaiser Criterion
```python
# Check eigenvalues
eigenvalues, _ = fa.get_eigenvalues()

# Keep factors with eigenvalue > 1
n_factors = sum(eigenvalues > 1)
```

### Option 2: Domain Knowledge

If you know how many categories your variables should group into, specify directly:
```python
# Example: health data with 3 expected categories (lifestyle, genetics, environment)
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
```

## Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Loadings all similar | Too few factors | Increase n_factors |
| Negative loadings | Inverse relationship | Normal, interpret direction |
| Low variance explained | Data not suitable for PCA | Check correlations first |

## Best Practices

- Always standardize data before PCA
- Use varimax rotation for interpretability
- Check factor loadings to name components
- Use Kaiser criterion or domain knowledge for n_factors
- For attribution analysis, run ONE global PCA on all variables