# spark-docker

> Manage Docker Spark infrastructure for local development or production (S3). Use this skill to start, stop, or check status of Spark clusters. Invoke with /spark-docker.

- Author: Daniel Sim
- Repository: wellcomecollection/wc_simd
- Version: 20260109153212
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/wellcomecollection/wc_simd
- Web: https://mule.run/skillshub/@@wellcomecollection/wc_simd~spark-docker:20260109153212

---

---
name: spark-docker
description: Manage Docker Spark infrastructure for local development or production (S3). Use this skill to start, stop, or check status of Spark clusters. Invoke with /spark-docker.
---

# Spark Docker Management

This skill manages the Docker-based Spark infrastructure for the wc_simd project.

> **Use `spark_docker_s3/` for all Spark work.** The HDFS-based `spark_docker/` is deprecated.

## Primary Stack: `spark_docker_s3/`

- Spark 3.5.5 + S3 warehouse + RDS MySQL metastore
- Uses EC2 instance profile for S3 access
- Configure `.env` from `.env.example` with S3 bucket and RDS credentials

## Commands

### Start Spark
```bash
cd spark_docker_s3 && docker compose up -d --build
```

First time setup requires `INIT_HIVE_SCHEMA=true` in `.env` to create metastore tables.

### Stop Spark
```bash
cd spark_docker_s3 && docker compose down
```

### Check Status
```bash
cd spark_docker_s3 && docker compose ps
```

### View Logs
```bash
cd spark_docker_s3 && docker compose logs -f spark
```

### Access Spark Shell
```bash
docker exec -it spark /opt/spark/bin/spark-sql
```

## Troubleshooting

### OOM Errors
Add to Spark config:
- `spark.sql.orc.enableVectorizedReader=false`
- `spark.sql.parquet.columnarReaderBatchSize=256`

### Derby Lock Issues
Use MySQL-backed metastore (already configured in Docker stacks).

### Host Service Access from Spark
Use Docker gateway IP `172.19.0.1` instead of `localhost`.

## Deprecated: `spark_docker/` (HDFS-based)

> **DEPRECATED**: This stack is no longer maintained. Use `spark_docker_s3/` instead.

The old HDFS-based local stack in `spark_docker/` required:
- `127.0.0.1 hadoop-namenode` in `/etc/hosts`
- Local HDFS storage (not portable across machines)