# subdomain-discover

> Discover subdomains using DNS bruteforce, Certificate Transparency logs, and sitemap parsing. Use when you need to map a brand's digital footprint, perform security reconnaissance, or prepare for crawler policy analysis.

- Author: seanbetts
- Repository: seanbetts/agent-smith
- Version: 20260206215413
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/seanbetts/agent-smith
- Web: https://mule.run/skillshub/@@seanbetts/agent-smith~subdomain-discover:20260206215413

---

---
name: subdomain-discover
description: Discover subdomains using DNS bruteforce, Certificate Transparency logs, and sitemap parsing. Use when you need to map a brand's digital footprint, perform security reconnaissance, or prepare for crawler policy analysis.
---

# subdomain-discover

Discover subdomains for a target domain using multiple reconnaissance techniques.

## Description

Performs comprehensive subdomain discovery using DNS bruteforce, Certificate Transparency logs, and sitemap parsing. Includes intelligent filtering to remove internal infrastructure domains and redirect-only subdomains, focusing on consumer-facing and public domains.

## When to Use

- Map the digital footprint of a brand or organization
- Discover public-facing web properties before analysis
- Security reconnaissance and infrastructure mapping
- Input for robots.txt/crawler policy analysis
- DNS management and subdomain inventory

## Requirements

- **dnspython** - DNS resolution (installed via pyproject.toml)
- **aiohttp** - Async HTTP requests (installed via pyproject.toml)
- **lxml** - XML parsing for sitemaps (installed via pyproject.toml)

## Scripts

### discover_subdomains.py
Discovers subdomains using multiple techniques with intelligent filtering.

```bash
python discover_subdomains.py DOMAIN [--wordlist FILE] [--timeout SECONDS] [--dns-timeout SECONDS] [--json] [--verbose]
```

**Arguments**:
- `DOMAIN`: Target domain to discover subdomains for (required)

**Options**:
- `--wordlist FILE`: Custom wordlist for DNS bruteforce (default: built-in list)
- `--timeout SECONDS`: HTTP request timeout (default: 10.0)
- `--dns-timeout SECONDS`: DNS resolution timeout (default: 5.0)
- `--json`: Output results in JSON format
- `--verbose`: Show detailed progress information
- `--no-filter`: Skip filtering internal/redirect domains (show all results)

**Discovery Techniques**:

1. **DNS Bruteforce**
   - Tests common subdomain names against DNS
   - Default wordlist includes 100+ high-value subdomains
   - Custom wordlist support for targeted discovery
   - Covers: www, api, docs, support, blog, shop, developer, etc.

2. **Certificate Transparency Logs**
   - Queries crt.sh and certspotter.com APIs
   - Discovers subdomains from SSL/TLS certificates
   - Often reveals historical and internal domains
   - Timeout: 30 seconds (CT APIs can be slow)

3. **Sitemap Parsing**
   - Extracts URLs from sitemap.xml and sitemap indexes
   - Parses robots.txt for sitemap references
   - Recursively processes sitemap indexes
   - Discovers cross-domain references

**Intelligent Filtering**:

The script applies multiple filtering passes to focus on public-facing domains:

- **Source Confidence Filtering**: Multi-source validation
  - Domains found by multiple methods = highest confidence
  - DNS bruteforce hits = high confidence (specifically tested)
  - Sitemap discoveries = medium-high confidence
  - CT-only domains = requires additional validation

- **Internal Domain Filtering**: Removes infrastructure
  - Development/testing: *-dev, *-test, *-staging, *-qa
  - Server identifiers: srv01, node123, us-east1
  - Cryptic codes: Short consonant-heavy subdomains
  - Wildcard and invalid domains

- **Consumer-Facing Prioritization**: Keeps valuable domains
  - E-commerce: store, shop, cart, checkout
  - Support: help, docs, support, knowledge
  - Content: blog, news, media, downloads
  - Developer: api, developer, console, docs
  - Business: partners, business, enterprise

- **Redirect Filtering**: Removes redirect-only domains
  - Tests each subdomain for HTTP 301/302/303/307/308
  - Filters out domains that don't serve actual content
  - Keeps main domain variants (www, non-www)

**Examples**:

```bash
# Basic discovery
python discover_subdomains.py example.com

# JSON output for piping to other tools
python discover_subdomains.py example.com --json

# Custom wordlist with verbose output
python discover_subdomains.py example.com --wordlist custom_subdomains.txt --verbose

# Skip filtering to see all discovered domains
python discover_subdomains.py example.com --no-filter --json

# Longer timeouts for slow networks
python discover_subdomains.py example.com --timeout 20 --dns-timeout 10
```

## Output Format

**Human-readable output** shows discovery progress and results:

```
Starting subdomain discovery for: example.com
DNS Bruteforce found: 15 domains
Certificate Transparency found: 42 domains
Sitemap Parsing found: 8 domains
Checking for redirect subdomains...
Filtered out 12 redirect subdomains

Found 23 domains/subdomains:
  - example.com
  - www.example.com
  - api.example.com
  - docs.example.com
  - support.example.com
  ...
```

**JSON output** provides structured data:

```json
{
  "success": true,
  "data": {
    "domain": "example.com",
    "discovered_count": 23,
    "domains": [
      "example.com",
      "www.example.com",
      "api.example.com",
      "docs.example.com",
      "support.example.com"
    ],
    "discovery_stats": {
      "dns_bruteforce": 15,
      "certificate_transparency": 42,
      "sitemap_parsing": 8,
      "filtered_redirects": 12,
      "final_count": 23
    },
    "duration_seconds": 45
  }
}
```

## Output Locations

Results are written to stdout (console or JSON). To save results:

```bash
# Save human-readable output
python discover_subdomains.py example.com > subdomains.txt

# Save JSON output
python discover_subdomains.py example.com --json > subdomains.json

# Pipe to other tools
python discover_subdomains.py example.com --json | jq '.data.domains[]'
```

## Performance Notes

- DNS bruteforce: ~100 domains/second (depends on DNS server)
- Certificate Transparency: 10-30 seconds (API dependent)
- Sitemap parsing: 2-5 seconds per sitemap
- Redirect checking: ~10 domains/second (batched requests)
- Total time: 30-90 seconds for typical domains

## Error Handling

**DNS Errors**:
- NXDOMAIN: Subdomain doesn't exist (normal, filtered out)
- Timeout: Increase `--dns-timeout` for slow DNS servers
- Rate limiting: DNS servers may rate limit, results may be incomplete

**HTTP Errors**:
- Connection failures: Domain may be unreachable (kept in results)
- Timeouts: Increase `--timeout` for slow servers
- SSL errors: Domain may have certificate issues (kept in results)

**API Errors**:
- CT API failures: Script continues with other methods
- Sitemap parsing errors: Malformed XML ignored

**Error Output**:
```json
{
  "success": false,
  "error": {
    "type": "DNSError|NetworkError|ValueError",
    "message": "Detailed error message",
    "suggestions": [
      "Check domain spelling",
      "Verify DNS server connectivity",
      "Try increasing timeout values"
    ]
  }
}
```

## Advanced Usage

### Custom Wordlist

Create a custom wordlist for targeted subdomain discovery:

```bash
# wordlist.txt
admin
staging
beta
test
dev
internal
vpn
```

```bash
python discover_subdomains.py example.com --wordlist wordlist.txt
```

### Pipeline Integration

Use with other sideBar skills:

```bash
# Discover subdomains, then analyze robots.txt policies
python discover_subdomains.py example.com --json > domains.json
python ../web-crawler-policy/scripts/analyze_policies.py \
  --domains $(jq -r '.data.domains[]' domains.json) \
  --output analysis.csv
```

### Filtering Control

```bash
# See all discovered domains (no filtering)
python discover_subdomains.py example.com --no-filter

# Compare filtered vs unfiltered
python discover_subdomains.py example.com --json > filtered.json
python discover_subdomains.py example.com --no-filter --json > unfiltered.json
diff <(jq '.data.domains' filtered.json) <(jq '.data.domains' unfiltered.json)
```

## Tips

- Use `--verbose` to understand which discovery methods are working
- Custom wordlists should focus on your target industry/organization
- CT logs are great for historical subdomains but may include outdated ones
- Redirect filtering removes many internal redirects but may filter legitimate domains
- For large organizations, expect 20-50 public-facing subdomains
- DNS bruteforce is most reliable but limited to tested names
- Sitemap parsing discovers cross-domain references missed by other methods

## Limitations

- DNS bruteforce only finds subdomains in wordlist
- CT logs may include expired or historical subdomains
- Sitemap parsing requires public sitemaps
- Cannot discover subdomains behind authentication
- Private/internal networks not accessible
- Rate limiting may affect completeness
- Wildcard DNS can cause false positives (filtered by default)

## Related Skills

- **web-crawler-policy** - Analyze robots.txt policies on discovered domains
- **dns-lookup** - Detailed DNS record analysis (future skill)
- **ssl-check** - SSL certificate analysis (future skill)