# terraform-state-recovery

> Recover from Terraform state issues after infrastructure recreation. Handles orphaned resources, state drift, and cluster recovery. Use when terraform apply fails with resource conflicts.

- Author: matchpoint-ai-bot
- Repository: Matchpoint-AI/matchpoint-github-runners-helm
- Version: 20260106171828
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Matchpoint-AI/matchpoint-github-runners-helm
- Web: https://mule.run/skillshub/@@Matchpoint-AI/matchpoint-github-runners-helm~terraform-state-recovery:20260106171828

---

---
name: terraform-state-recovery
description: Recover from Terraform state issues after infrastructure recreation. Handles orphaned resources, state drift, and cluster recovery. Use when terraform apply fails with resource conflicts.
allowed-tools:
  - Bash
  - Read
  - Grep
---

# Terraform State Recovery Skill

## Overview

When Kubernetes clusters are recreated (e.g., Rackspace Spot cloudspace deleted and recreated), Terraform state contains references to resources that no longer exist. This causes apply failures that require manual state cleanup.

## The Problem

### Scenario: Cluster Recreated

1. Initial state: Cluster A with ArgoCD, namespaces, secrets
2. Cluster deleted (spot preemption, manual deletion, provider issue)
3. Cluster B created with same name
4. Terraform apply fails:

```
Error: Resource already exists in state

Resource kubernetes_namespace.arc_runners already exists in state, but the
underlying Kubernetes cluster has been recreated. The resource exists in
terraform state but not in the actual cluster.
```

### Why This Happens

Terraform tracks resources by their Terraform resource ID, not by the underlying infrastructure:

```hcl
resource "kubernetes_namespace" "arc_runners" {
  # State: terraform_resource_id_123
  # Points to: Cluster A (no longer exists)
}
```

When Cluster B is created:
- Terraform state still references Cluster A resources
- `terraform plan` shows resources "already exist" (in state)
- `terraform apply` fails when trying to create them (they don't exist in Cluster B)

## Common Orphaned Resources

After cluster recreation, these resources are typically orphaned:

### Kubernetes Resources

```bash
# Check for orphaned Kubernetes resources
terraform state list | grep kubernetes_

# Common orphans:
kubernetes_namespace.arc_runners
kubernetes_namespace.arc_systems
kubernetes_secret.github_token
kubernetes_secret.argocd_secret
kubernetes_config_map_v1_data.argocd_cm
```

### Helm Releases

```bash
# Check for orphaned Helm releases
terraform state list | grep helm_release

# Common orphans:
helm_release.argocd
helm_release.arc_controller
```

### kubectl_manifest Resources

```bash
# Check for orphaned kubectl manifests
terraform state list | grep kubectl_manifest

# Common orphans:
kubectl_manifest.argocd_bootstrap
kubectl_manifest.argocd_app_arc_controller
```

## Recovery Procedure

### Step 1: Identify Orphaned Resources

```bash
cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init

# List all resources in state
terraform state list > /tmp/state-resources.txt

# Check which cluster the state references
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id

# Compare with actual cloudspace
spotctl cloudspaces list --org matchpoint-ai -o table
```

### Step 2: Remove Orphaned Resources

**CRITICAL:** Only remove resources from OLD cluster. Do NOT remove:
- `spot_cloudspace.main` (the cluster itself)
- `spot_nodepool.*` (node pools)
- `data.spot_kubeconfig.*` (kubeconfig data sources)

```bash
# Remove Helm releases (they don't exist in new cluster)
terraform state rm helm_release.argocd

# Remove Kubernetes namespaces
terraform state rm kubernetes_namespace.arc_runners
terraform state rm kubernetes_namespace.arc_systems

# Remove Kubernetes secrets
terraform state rm kubernetes_secret.github_token
terraform state rm kubernetes_secret.argocd_secret

# Remove ConfigMaps
terraform state rm kubernetes_config_map_v1_data.argocd_cm

# Remove kubectl manifests
terraform state rm kubectl_manifest.argocd_bootstrap
terraform state rm kubectl_manifest.argocd_app_arc_controller
```

### Step 3: Verify State is Clean

```bash
# List remaining resources
terraform state list

# Should see:
# - module.cloudspace.spot_cloudspace.main
# - module.nodepool.spot_nodepool.*
# - data.spot_kubeconfig.this
# Should NOT see:
# - kubernetes_* resources
# - helm_release.* resources
# - kubectl_manifest.* resources
```

### Step 4: Re-Apply

```bash
# Plan should show creating all Kubernetes resources
terraform plan -var-file=prod.tfvars

# Apply to recreate resources in new cluster
terraform apply -var-file=prod.tfvars
```

## Automated Recovery Script

```bash
#!/bin/bash
# terraform/scripts/clean-orphaned-state.sh

set -euo pipefail

echo "🔍 Identifying orphaned Kubernetes resources..."

# Get list of Kubernetes resources in state
ORPHANED=$(terraform state list | grep -E "(kubernetes_|helm_release|kubectl_manifest)" || true)

if [ -z "$ORPHANED" ]; then
  echo "✅ No orphaned resources found"
  exit 0
fi

echo "📋 Found orphaned resources:"
echo "$ORPHANED"
echo ""

read -p "Remove these resources from state? (yes/no): " CONFIRM

if [ "$CONFIRM" != "yes" ]; then
  echo "❌ Aborted"
  exit 1
fi

echo "$ORPHANED" | while read -r resource; do
  echo "🗑️  Removing: $resource"
  terraform state rm "$resource"
done

echo "✅ State cleanup complete"
echo ""
echo "Next steps:"
echo "1. Run: terraform plan -var-file=prod.tfvars"
echo "2. Verify plan shows creating resources (not updating)"
echo "3. Run: terraform apply -var-file=prod.tfvars"
```

**Usage:**
```bash
cd terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
./scripts/clean-orphaned-state.sh
```

## Diagnosis: Is State Orphaned?

### Check 1: Cluster ID Mismatch

```bash
# Get cloudspace ID from terraform state
terraform state show module.cloudspace.spot_cloudspace.main | grep cloudspace_id

# Get actual cloudspace ID
spotctl cloudspaces get --name matchpoint-runners-prod --org matchpoint-ai -o json | jq -r .cloudspaceId

# If different → cluster was recreated
```

### Check 2: Resource Shows "Not Found" in Plan

```bash
terraform plan -var-file=prod.tfvars

# Look for:
# ~ resource "kubernetes_namespace" "arc_runners" {
#     # Warning: resource not found in cluster
# }
```

### Check 3: kubectl Confirms Resources Don't Exist

```bash
# Get fresh kubeconfig
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml

# Check if resources exist
kubectl get namespace arc-runners
# Error: namespace "arc-runners" not found → Orphaned in state

# But terraform state shows:
terraform state show kubernetes_namespace.arc_runners
# Shows resource in state → State is stale
```

## Prevention Strategies

### Strategy 1: Use Data Sources Where Possible

Instead of managing resources in Terraform, reference them as data sources:

```hcl
# FRAGILE - resource managed by Terraform
resource "kubernetes_namespace" "arc_runners" {
  metadata {
    name = "arc-runners"
  }
}

# ROBUST - reference namespace created by ArgoCD
data "kubernetes_namespace" "arc_runners" {
  metadata {
    name = "arc-runners"
  }
}
```

Data sources don't persist in state, so they can't become orphaned.

### Strategy 2: Let ArgoCD Manage Application Resources

```hcl
# Terraform manages infrastructure
resource "spot_cloudspace" "main" { }
resource "helm_release" "argocd" { }

# ArgoCD manages applications
# - Namespaces
# - Secrets (via SealedSecrets or external-secrets)
# - ConfigMaps
# - Deployments
```

This separation means:
- Cluster recreation only affects Terraform resources (infrastructure)
- Application resources recreated automatically by ArgoCD sync

### Strategy 3: Use Remote State Locking

Prevent concurrent applies that can corrupt state:

```hcl
# backend.tf
terraform {
  backend "http" {
    address        = "https://state.tfstate.dev/github/v1"
    lock_address   = "https://state.tfstate.dev/github/v1/lock"
    unlock_address = "https://state.tfstate.dev/github/v1/lock"
  }
}
```

## Troubleshooting

### Error: "Resource not found"

**Symptom:**
```
Error: reading Kubernetes Namespace "arc-runners": namespaces "arc-runners" not found
```

**Cause:** Resource exists in state but not in cluster

**Fix:**
```bash
terraform state rm kubernetes_namespace.arc_runners
terraform apply
```

### Error: "State lock timeout"

**Symptom:**
```
Error: Error acquiring the state lock
Lock Info:
  ID:        abc-123-def
  Operation: OperationTypeApply
  Who:       user@host
  Created:   2024-01-01 12:00:00 UTC
```

**Cause:** Previous terraform apply crashed or was interrupted

**Fix:**
```bash
# Verify no terraform process running
ps aux | grep terraform

# Force unlock (only if safe)
terraform force-unlock abc-123-def
```

### Error: "Provider configuration changed"

**Symptom:**
```
Error: Provider configuration changed

The provider configuration for provider["kubernetes"] has changed. This may
be because the kubeconfig references a different cluster.
```

**Cause:** Kubeconfig points to new cluster but state references old cluster

**Fix:**
```bash
# Get fresh kubeconfig for current cluster
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml

# Remove orphaned Kubernetes resources
./scripts/clean-orphaned-state.sh

# Re-apply
terraform apply -var-file=prod.tfvars
```

## Advanced Recovery: Import Resources

If resources exist in the NEW cluster but not in state:

```bash
# Import namespace
terraform import kubernetes_namespace.arc_runners arc-runners

# Import secret
terraform import kubernetes_secret.github_token arc-runners/arc-org-github-secret

# Import Helm release
terraform import helm_release.argocd argocd/argocd
```

**When to use import:**
- Resources manually created in cluster
- Need to bring them under Terraform management
- Alternative to destroying and recreating

**When NOT to use import:**
- Resources don't exist (use `terraform state rm` instead)
- Resources managed by ArgoCD (let ArgoCD manage them)

## Diagnostic Commands

```bash
# List all resources in state
terraform state list

# Show specific resource details
terraform state show kubernetes_namespace.arc_runners

# Pull current state to local file
terraform state pull > /tmp/terraform.tfstate

# Inspect state JSON
jq '.resources[] | select(.type == "kubernetes_namespace")' /tmp/terraform.tfstate

# Check cluster connectivity
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
kubectl cluster-info

# Verify state backend connection
terraform init -backend-config="password=$TF_HTTP_PASSWORD"
```

## State File Forensics

### Understanding State Structure

```json
{
  "resources": [
    {
      "mode": "managed",
      "type": "kubernetes_namespace",
      "name": "arc_runners",
      "provider": "provider[\"kubernetes\"]",
      "instances": [
        {
          "attributes": {
            "metadata": [{"name": "arc-runners"}]
          }
        }
      ]
    }
  ]
}
```

**Key fields:**
- `mode: "managed"` - Terraform manages this resource
- `mode: "data"` - Terraform only reads this resource
- `instances[].attributes` - Current resource configuration

### Finding Orphaned Resources

```bash
# Extract all Kubernetes resources
terraform state pull | jq -r '.resources[] | select(.type | startswith("kubernetes_")) | .type + "." + .name'

# Compare with actual cluster resources
kubectl get namespaces -o name
kubectl get secrets -A -o name
```

## Related Skills

- [arc-terraform-deployment](../arc-terraform-deployment/SKILL.md) - Avoiding orphaned state
- [infrastructure-cd](../infrastructure-cd/SKILL.md) - Automated terraform workflow
- [argocd-bootstrap](../argocd-bootstrap/SKILL.md) - Separating app management

## Related Issues

- #115 - Cluster DNS not resolving (cluster recreation scenario)
- #121 - State cleanup after ApplicationSet fix
- #119 - Namespace creation conflicts

## References

- [Terraform State](https://developer.hashicorp.com/terraform/language/state)
- [Terraform State Commands](https://developer.hashicorp.com/terraform/cli/commands/state)
- [Kubernetes Provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs)