# kosmos-e2e-testing

> Comprehensive end-to-end testing automation for the Kosmos autonomous AI scientist project. Supports local models (Ollama), external APIs (Anthropic/OpenAI), and Docker sandbox for full workflow testing.

- Author: Jim McMillan
- Repository: Zeeeepa/Kosmos
- Version: 20260125204901
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/Zeeeepa/Kosmos
- Web: https://mule.run/skillshub/@@Zeeeepa/Kosmos~kosmos-e2e-testing:20260125204901

---

# Kosmos E2E Testing Skill

Comprehensive end-to-end testing automation for the Kosmos autonomous AI scientist project. Supports local models (Ollama), external APIs (Anthropic/OpenAI), and Docker sandbox for full workflow testing.

## Triggers

| Trigger | Description |
|---------|-------------|
| `kosmos test` | Run Kosmos tests with auto-detected provider |
| `kosmos e2e` | Set up and run E2E testing |
| `test workflow` | Test the ResearchWorkflow component |
| `local testing` | Configure local model testing |
| `provider switch` | Switch between test providers |
| `benchmark models` | Compare local vs API performance |
| `setup docker` | Set up Docker sandbox for Gap 4 |

## Quick Start

### 1. Check Environment
```bash
# Run health check to see what's available
.claude/skills/kosmos-e2e-testing/scripts/health-check.sh
```

### 2. Run Tests by Tier

```bash
# Sanity tests (~30s) - Quick validation with fast model
./scripts/run-tests.sh sanity

# Smoke tests (~2min) - Component checks
./scripts/run-tests.sh smoke

# E2E tests (~10min) - Full workflow with reasoning model
./scripts/run-tests.sh e2e

# Full suite (~20min) - Everything with coverage
./scripts/run-tests.sh full
```

### 3. Specify Provider

```bash
# Use fast local model (Qwen3 4B)
./scripts/run-tests.sh sanity local-fast

# Use reasoning model (DeepSeek-R1 8B)
./scripts/run-tests.sh e2e local-reasoning

# Use Anthropic API
./scripts/run-tests.sh e2e anthropic

# Auto-detect best available
./scripts/run-tests.sh e2e auto
```

## Test Tiers

| Tier | Duration | Provider | What It Tests |
|------|----------|----------|---------------|
| **Sanity** | ~30s | Fast local | Basic imports, config loading, mock workflow |
| **Smoke** | ~2min | Fast local | Unit tests + smoke tests |
| **E2E** | ~10min | Reasoning | Full research workflow, all gaps |
| **Production** | ~20min | External API | Final validation with Claude/GPT-4 |

## Provider Configuration

### Local Models (Ollama)

**Fast Model (qwen3:4b)**
- Speed: 30-40 tok/s
- VRAM: 2-3 GB
- Use for: Sanity, smoke, rapid iteration

**Reasoning Model (deepseek-r1:8b)**
- Speed: 6-7 tok/s
- VRAM: 5-6 GB
- Use for: E2E, complex reasoning, validation

### Setup Local Models
```bash
# Install models
ollama pull qwen3:4b
ollama pull deepseek-r1:8b

# Verify
ollama list
```

### External APIs

Set in `.env` or use config files:

```bash
# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
export OPENAI_API_KEY=sk-...

# Or source config file
source .claude/skills/kosmos-e2e-testing/configs/anthropic.env
```

## Docker Sandbox (Gap 4)

Required for full E2E testing that involves code execution.

### Auto-Setup
```bash
.claude/skills/kosmos-e2e-testing/scripts/setup-docker.sh
```

### Manual Setup
```bash
# Build sandbox image
cd docker/sandbox
docker build -t kosmos-sandbox:latest .

# Verify
docker run --rm kosmos-sandbox:latest python3 -c "import pandas; print('OK')"
```

## Integration with local-llm Skill

This skill works with the global `local-llm` skill for model management:

```bash
# Use local-llm triggers for model operations
# "How do I manage Ollama models?" → local-llm skill
# "Run Kosmos E2E tests" → this skill

# Shared model names
# qwen3:4b - Same as local-llm fast-model template
# deepseek-r1:8b - Same as local-llm reasoning-model template
```

## Python API

```python
from lib.provider_detector import detect_all, recommend_test_tier
from lib.test_runner import run_tests
from lib.config_manager import load_config, switch_provider

# Check what's available
status = detect_all()
print(f"Ollama: {status['ollama']}")
print(f"Docker: {status['docker_sandbox']}")
print(f"Recommended tier: {recommend_test_tier(status)}")

# Run tests programmatically
results = run_tests(tier='e2e', provider='local-reasoning')
print(f"Passed: {results['passed']}/{results['total']}")
```

## Directory Structure

```
.claude/skills/kosmos-e2e-testing/
├── SKILL.md              # This file
├── CHEATSHEET.md         # Quick reference
├── reference.md          # Technical details
├── examples.md           # Usage examples
├── configs/              # Provider configurations
│   ├── local-fast.env
│   ├── local-reasoning.env
│   ├── anthropic.env
│   └── openai.env
├── templates/            # Test scripts
│   ├── sanity-test.py
│   ├── smoke-test.py
│   ├── e2e-runner.py
│   └── benchmark.py
├── scripts/              # Shell automation
│   ├── run-tests.sh
│   ├── setup-docker.sh
│   ├── switch-provider.sh
│   └── health-check.sh
└── lib/                  # Python library
    ├── provider_detector.py
    ├── test_runner.py
    ├── config_manager.py
    └── report_generator.py
```

## Service Availability Matrix

| Test Category | Anthropic | Docker | Neo4j | Redis | ChromaDB |
|---------------|-----------|--------|-------|-------|----------|
| Unit (gap modules) | Mock | No | No | No | No |
| Unit (literature) | Mock | No | No | No | No |
| Unit (knowledge) | Mock | No | Yes | No | Yes |
| Unit (execution) | No | Yes | No | No | No |
| Integration | Real/Mock | No | Mock | Mock | Mock |
| E2E | Real | Yes | Optional | Optional | Optional |

## Known Issues & Limitations

1. **arxiv package incompatibility**: Fails on Python 3.11+ due to `sgmllib3k` dependency. Literature search features limited.

2. **Docker requirement**: Gap 4 execution environment requires Docker. Without it, code execution uses mock/direct implementations.

3. **Database model issues**: Some tests skip due to "Hypothesis model ID missing autoincrement=True" - model definition issue.

4. **Complex agent setup**: Some agents (ExperimentDesigner, DataAnalyst) require complex object initialization.

5. **API mismatches**: Some integration tests have API mismatches with current implementation.

6. **No R support**: Paper references R packages; implementation is Python-only.

7. **Single-user**: No multi-tenancy or user isolation.

## Troubleshooting

### Ollama Not Responding
```bash
# Check if running
curl http://localhost:11434/api/tags

# Start service
ollama serve

# Check logs
journalctl -u ollama -f
```

### Docker Issues
```bash
# Check Docker daemon
docker info

# Start Docker
sudo systemctl start docker

# Rebuild sandbox if corrupted
docker rmi kosmos-sandbox:latest
./scripts/setup-docker.sh
```

### Tests Timing Out
```bash
# Increase timeout
pytest tests/e2e/ -v --timeout=900

# Or run with reasoning model (slower but more reliable)
./scripts/run-tests.sh e2e local-reasoning
```

### API Key Issues
```bash
# Verify key is set
echo $ANTHROPIC_API_KEY

# Check .env file
cat .env | grep API_KEY

# Re-source config
source .claude/skills/kosmos-e2e-testing/configs/anthropic.env
```

### Python 3.11+ Package Issues
```bash
# If arxiv package fails
# Option 1: Use mock for literature search
export MOCK_LITERATURE_SEARCH=true

# Option 2: Install alternative client
pip install arxiv-python

# Option 3: Pin Python to 3.10
pyenv install 3.10.12
pyenv local 3.10.12
```

### Generate Dependency Report
```bash
# Generate E2E_TESTING_DEPENDENCY_REPORT.md
python -c "from lib.report_generator import generate_dependency_report; generate_dependency_report()"
```

## See Also

- `CHEATSHEET.md` - Quick command reference
- `reference.md` - Technical API documentation
- `examples.md` - Detailed usage examples
- `~/.claude/skills/local-llm/` - Local model management
- `E2E_TESTING_GUIDE.md` - General E2E testing guide