# wordlift-kg-builder

> Build and maintain Knowledge Graphs from webpages using WordLift APIs. Use when importing pages from sitemaps via WordLift Sitemap Import API, creating product catalogs with GS1 Digital Link identifiers (GTIN-based), generating slug-based entity IDs for organizations/people/webpages, creating JSON-LD markup programmatically, or performing daily sync workflows with batch operations and PATCH updates. Handles entity lifecycle management with proper JSON-LD structure.

- Author: cyberandy
- Repository: wordlift/wordlift-gemini-cli-extension
- Version: 20260112192805
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/wordlift/wordlift-gemini-cli-extension
- Web: https://mule.run/skillshub/@@wordlift/wordlift-gemini-cli-extension~wordlift-kg-builder:20260112192805

---

---
name: wordlift-kg-builder
description: Build and maintain Knowledge Graphs from webpages using WordLift APIs. Use when importing pages from sitemaps via WordLift Sitemap Import API, creating product catalogs with GS1 Digital Link identifiers (GTIN-based), generating slug-based entity IDs for organizations/people/webpages, creating JSON-LD markup programmatically, or performing daily sync workflows with batch operations and PATCH updates. Handles entity lifecycle management with proper JSON-LD structure.
---

# WordLift Knowledge Graph Builder

Build and maintain Knowledge Graphs from webpages using WordLift's Sitemap Import API, with focus on product catalogs and e-commerce data.

## Core Capabilities

1. **Sitemap Import API**: Direct import of URLs from sitemap.xml or URL lists
2. **Template Configuration**: Interactive workflow to validate markup templates before bulk imports
3. **GS1 Digital Link for Products**: `{dataset_uri}/01/{GTIN-14}` identifiers
4. **Slug-based IDs for Other Entities**: `{dataset_uri}/{entity_type}/{slug}` format (⚠️ **MUST use recognized patterns**)
5. **Entity Reuse via GraphQL**: Prevents duplicates by checking for existing entities (Organizations, Brands, People)
6. **SHACL Validation**: Ensures data quality before upload with built-in shapes for Products, Organizations, WebPages, etc.
7. **JSON-LD Creation**: Programmatic creation of schema.org markup with EntityBuilder
8. **Entity Upgrading**: Post-import type changes and property updates using Fetch-Modify-Update pattern
9. **Entity Verification**: Verify entities are actually persisted (not just 200 OK)
10. **Daily Sync Workflows**: Full replacement or incremental PATCH updates
11. **Batch Operations**: Efficient bulk create/update operations

## Quick Start

### 1. Import Pages from Sitemap

Use the Sitemap Import API to jumpstart your Knowledge Graph:

**API Endpoint:** `POST https://api.wordlift.io/sitemap-imports`

```python
from scripts.wordlift_client import WordLiftClient

client = WordLiftClient(api_key)

# Import from sitemap.xml
results = client.import_from_sitemap("https://example.com/sitemap.xml")
print(f"Imported {len(results)} pages")

# Or import specific URLs
results = client.import_from_urls([
    "https://example.com/page1.html",
    "https://example.com/page2.html"
])
```

The API returns NDJSON (newline-delimited JSON) with details about each imported page.

**Important:** The endpoint is `/sitemap-imports` (plural), not `/sitemap/import` or `/sitemap-import`.

### 2. Query Imported Data

After import, query the data via GraphQL:

```python
result = client.graphql_query("""
  query {
    entities(page: 0, rows: 1000) {
      id: iri
      headline: string(name: "schema:headline")
      text: string(name: "schema:text")
      url: string(name: "schema:url")
    }
  }
""")
```

### 3. Enhance with Proper Product Entities

For e-commerce, create products with GS1 Digital Link IDs:

```python
from scripts.entity_builder import EntityBuilder

builder = EntityBuilder("https://data.wordlift.io/wl123")

product = builder.build_product({
    'gtin': '12345678901231',
    'name': 'Product Name',
    'brand': 'Brand Name',
    'price': '29.99',
    'currency': 'USD'
})

client.create_or_update_entity(product)
```

## Entity ID Generation

### Products (GS1 Digital Link)

Products use GS1 Digital Link format with GTIN-14:

```python
from scripts.id_generator import generate_product_id

# Basic product
product_id = generate_product_id("https://data.wordlift.io/wl123", "12345678901231")
# Result: https://data.wordlift.io/wl123/01/12345678901231

# With serial number
product_id = generate_product_id("https://data.wordlift.io/wl123", "12345678901231", serial="SN123")
# Result: https://data.wordlift.io/wl123/01/12345678901231/21/SN123
```

GTINs are automatically:
- Normalized to 14 digits (left-padded with zeros)
- Validated using GS1 check digit algorithm

### Other Entities (Slug-based)

Non-product entities use descriptive slug-based IDs:

```python
from scripts.id_generator import generate_entity_id

# Organization
org_id = generate_entity_id("https://data.wordlift.io/wl123", "organization", "Acme Corporation")
# Result: https://data.wordlift.io/wl123/organization/acme-corporation

# Person
person_id = generate_entity_id("https://data.wordlift.io/wl123", "person", "John Doe")
# Result: https://data.wordlift.io/wl123/person/john-doe

# WebPage (slug from URL path or title)
page_id = generate_entity_id("https://data.wordlift.io/wl123", "webpage", "About Us")
# Result: https://data.wordlift.io/wl123/webpage/about-us

# WebPage homepage
homepage_id = generate_entity_id("https://data.wordlift.io/wl123", "webpage", "homepage")
# Result: https://data.wordlift.io/wl123/webpage/homepage

# State-specific service
service_id = generate_entity_id("https://data.wordlift.io/wl123", "service", "debt-consolidation-alaska")
# Result: https://data.wordlift.io/wl123/service/debt-consolidation-alaska
```

Slug generation:
- Converts to lowercase
- Replaces spaces with hyphens
- Removes non-alphanumeric characters
- Handles consecutive hyphens

**Important:** The page URL goes in the `url` property, while the @id uses the slug-based pattern within your dataset URI.


## ⚠️ IRI Pattern Requirements (CRITICAL)

**CRITICAL**: WordLift requires specific IRI path patterns. The API will **return 200 OK for invalid patterns** but **entities will NOT be persisted** (silent failure).

### Valid Patterns Only

| Entity Type | Required Pattern | Example |
|------------|------------------|---------|
| Products | `/01/{GTIN-14}` | `https://data.wordlift.io/wl123/01/12345678901234` |
| Organizations | `/organization/{slug}` | `https://data.wordlift.io/wl123/organization/acme` |
| Places | `/place/{slug}` | `https://data.wordlift.io/wl123/place/italy` |
| People | `/person/{slug}` | `https://data.wordlift.io/wl123/person/john-doe` |
| Destinations | `/destination/{slug}` | `https://data.wordlift.io/wl123/destination/venice` |
| Articles | `/article/{slug}` | `https://data.wordlift.io/wl123/article/news` |

**Invalid patterns** (accepted by API but NOT persisted):
- ❌ `/sejour/country/destination` (auto-generated from sitemap)
- ❌ `/custom/nested/path` (arbitrary nesting)
- ❌ `/mytype/{slug}` (unrecognized entity type)

### Always Verify Entity Persistence

```python
from scripts.entity_verifier import verify_entity_persisted

# After creating entity
is_persisted, message = verify_entity_persisted(entity['@id'], wait_seconds=2)

if not is_persisted:
    print(f"⚠️  CRITICAL: Entity not persisted! Reason: {message}")
    # Check IRI pattern and recreate with valid pattern
```

**The `generate_entity_id()` function now validates patterns and will raise ValueError for invalid patterns.**

See `references/iri-patterns-and-verification.md` for complete guide.

## Creating JSON-LD Entities

### Build Entities Programmatically

Use `EntityBuilder` to create schema.org JSON-LD entities:

```python
from scripts.entity_builder import EntityBuilder

builder = EntityBuilder("https://data.wordlift.io/wl92832")

# Create a Product
product = builder.build_product({
    'gtin': '12345678901231',
    'name': 'Product Name',
    'description': 'Product description',
    'brand': 'Nike',
    'price': '99.99',
    'currency': 'USD',
    'availability': 'InStock',
    'image': 'https://example.com/product.jpg'
})

# Upload to KG
client.create_or_update_entity(product)
```

### Validate Before Upload

Always validate entities before uploading:

```python
from scripts.shacl_validator import SHACLValidator

validator = SHACLValidator()

# Validate
is_valid, errors, warnings = validator.validate(product, strict=True)

if is_valid:
    print("✓ Valid! Safe to upload")
    client.create_or_update_entity(product)
else:
    print(f"✗ Validation errors: {errors}")
```

The validator checks:
- Required fields (@context, @type, @id)
- Entity-specific requirements (Product needs name, gtin14)
- Proper URL formats
- GS1 Digital Link format for products
- Offer structure (price, currency, availability)

### Supported Entity Types

```python
# Organization
org = builder.build_organization({
    'name': 'Company Name',
    'url': 'https://example.com',
    'logo': 'https://example.com/logo.png'
})

# Person
person = builder.build_person({
    'name': 'John Doe',
    'jobTitle': 'CEO',
    'email': 'john@example.com'
})

# WebPage
webpage = builder.build_webpage({
    'url': 'https://example.com/about',
    'name': 'About Us',
    'description': 'Learn about our company'
})
```

## Entity Reuse (Preventing Duplicates)

### Problem
When creating multiple products or articles, you often reference the same entities:
- **Brands** (e.g., "Nike" across 100 products)
- **Publishers** (e.g., "Acme Corporation" across articles)
- **Authors** (e.g., "John Doe" across blog posts)

Without checking, you'd create duplicates every time, fragmenting your data.

### Solution: EntityReuseManager

The `EntityReuseManager` uses GraphQL queries to check if entities already exist:

```python
from scripts.entity_reuse import EntityReuseManager
from scripts.entity_builder import EntityBuilder

client = WordLiftClient(api_key)
reuse_manager = EntityReuseManager(client, "https://data.wordlift.io/wl123")

# Preload cache for fast lookups
reuse_manager.preload_cache()
# Output:
#   Loaded 45 organizations
#   Loaded 230 brands
#   Loaded 12 people

# Create builder with reuse manager
builder = EntityBuilder(dataset_uri, reuse_manager=reuse_manager)

# Build products - brands are automatically reused
product1 = builder.build_product({'gtin': '12345', 'brand': 'Nike', ...})
# Output: + Creating new brand: Nike

product2 = builder.build_product({'gtin': '67890', 'brand': 'Nike', ...})
# Output: ✓ Reusing existing brand: Nike

# Both products reference the same Nike brand entity!
```

### Supported Entity Types

```python
# Organizations (Publishers)
publisher_iri = reuse_manager.get_or_create_organization({
    'name': 'Acme Corporation',
    'url': 'https://acme.com',
    'logo': 'https://acme.com/logo.png'
})

# People (Authors)
author_iri = reuse_manager.get_or_create_person({
    'name': 'John Doe',
    'jobTitle': 'Senior Writer'
})

# Brands
brand = reuse_manager.get_or_create_brand('Nike')
```

### How It Works

1. **Cache Check** - Fast in-memory lookup
2. **IRI Check** - Query KG for expected IRI via GraphQL
3. **Name Check** - Query KG by name (in case different slug)
4. **Create Only If Not Found** - Avoids duplicates

See `references/entity-reuse-and-validation.md` for complete documentation.

## SHACL Validation (Data Quality)

### Problem
Invalid data breaks your Knowledge Graph:
- Missing required fields (name, GTIN)
- Invalid formats (wrong GTIN length, bad URLs)
- Incorrect structure (missing Offer in Product)

### Solution: SHACLValidator

Built-in SHACL shapes validate entities before upload:

```python
from scripts.shacl_validator import SHACLValidator

validator = SHACLValidator()

# Validate single entity
is_valid, errors, warnings = validator.validate(product)

if is_valid:
    print("✓ Valid! Safe to upload")
    client.create_or_update_entity(product)
else:
    print(f"✗ Invalid: {errors}")
```

### Built-in Shapes

**Product:**
- Required: `@id`, `@type`, `name`, `gtin14`
- Recommended: `description`, `brand`, `offers`, `image`
- Validates: GTIN format, GS1 Digital Link IRI, Offer structure

**Organization:**
- Required: `@id`, `@type`, `name`
- Recommended: `url`, `logo`, `description`

**WebPage:**
- Required: `@id`, `@type`, `url`, `name`
- Recommended: `description`, `datePublished`

**Offer:**
- Required: `@type`, `price`, `priceCurrency`
- Validates: Currency code (3 chars), availability URL format

### Batch Validation

```python
validator = SHACLValidator()

results = validator.validate_batch(entities)

print(f"Valid: {results['valid']}")
print(f"Invalid: {results['invalid']}")

# Get detailed report
report = validator.get_validation_report(results)
print(report)

# Filter valid entities
from scripts.shacl_validator import validate_before_upload

valid_entities, invalid_entities = validate_before_upload(entities)

# Upload only valid
client.batch_create_or_update(valid_entities)
```

### Validation Modes

**Normal Mode** (warnings for recommended fields):
```python
validator.validate(entity, strict=False)
```

**Strict Mode** (errors for recommended fields):
```python
validator.validate(entity, strict=True)
```

See `references/entity-reuse-and-validation.md` for complete documentation.

## Integration in Sync Workflows

Both features are enabled by default:

```python
from scripts.kg_sync import KGSyncOrchestrator

orchestrator = KGSyncOrchestrator(
    api_key=api_key,
    dataset_uri="https://data.wordlift.io/wl123",
    enable_validation=True,  # SHACL validation
    enable_reuse=True        # Entity reuse
)

# During sync:
# 1. Preloads entity cache (organizations, brands, people)
# 2. Reuses existing entities automatically
# 3. Validates all entities with SHACL shapes
# 4. Uploads only valid entities

stats = orchestrator.sync_products(products_data)
```

**Command-line:**
```bash
# With validation and reuse (default)
python scripts/kg_sync.py \
  --api-key YOUR_KEY \
  --dataset-uri https://data.wordlift.io/wl123 \
  --input products.json

# Disable validation (not recommended)
python scripts/kg_sync.py \
  --input products.json \
  --no-validation

# Disable entity reuse (not recommended)
python scripts/kg_sync.py \
  --input products.json \
  --no-reuse
```


### Product Entity

```python
from scripts.entity_builder import EntityBuilder

builder = EntityBuilder("https://data.wordlift.io/wl123")

product = builder.build_product({
    'gtin': '12345678901231',
    'name': 'Product Name',
    'description': 'Product description',
    'brand': 'Brand Name',
    'price': '29.99',
    'currency': 'USD',
    'sku': 'SKU-001',
    'image': 'https://example.com/image.jpg',
    'availability': 'InStock'
})
```

Result is proper JSON-LD with:
- GS1 Digital Link @id
- schema.org vocabulary
- Validated structure

### Organization Entity

```python
org = builder.build_organization({
    'name': 'Acme Corporation',
    'url': 'https://acme.com',
    'logo': 'https://acme.com/logo.png',
    'email': 'info@acme.com'
})
# ID: https://data.wordlift.io/wl123/organization/acme-corporation
```

### Web Page Entity

```python
webpage = builder.build_webpage({
    'url': 'https://example.com/about',
    'name': 'About Us',
    'description': 'Learn about our company',
    'datePublished': '2024-01-01'
})
# @id: https://data.wordlift.io/wl123/webpage/about-us
# url: https://example.com/about (in the url property)

# With custom slug
webpage = builder.build_webpage({
    'url': 'https://example.com/contact',
    'name': 'Contact Us',
    'slug': 'contact'  # Custom slug
})
# @id: https://data.wordlift.io/wl123/webpage/contact

# Homepage
homepage = builder.build_webpage({
    'url': 'https://example.com/',
    'name': 'Homepage',
    'slug': 'homepage'
})
# @id: https://data.wordlift.io/wl123/webpage/homepage
```

The @id uses a slug-based pattern within your dataset URI, while the actual page URL is stored in the `url` property.

## Syncing to WordLift

### Batch Create/Update

```python
from scripts.wordlift_client import WordLiftClient
from scripts.entity_builder import EntityBuilder

client = WordLiftClient(api_key)
builder = EntityBuilder("https://data.wordlift.io/wl123")

entities = [
    builder.build_product({...}),
    builder.build_product({...}),
    builder.build_organization({...})
]

# Batch operation (upsert - creates or updates)
client.batch_create_or_update(entities)
```

### Incremental Updates (PATCH)

For daily syncs where only some fields change:

```python
# Patch specific fields only
client.patch_entity(
    entity_id="https://data.wordlift.io/wl123/01/12345678901231",
    patches=[
        {"op": "replace", "path": "/https://schema.org/offers/https://schema.org/price", "value": "34.99"},
        {"op": "add", "path": "/https://schema.org/image", "value": "https://example.com/new.jpg"}
    ]
)
```

## Querying the KG

### Check Existing Products

```python
# Get all products
products = client.get_products(limit=100)

# Get all existing GTINs
existing_gtins = client.get_all_product_gtins()

# Check if entity exists
exists = client.entity_exists("https://data.wordlift.io/wl123/01/12345678901231")
```

### Custom GraphQL Queries

See `references/graphql_queries.md` for common patterns.

```python
# Get imported pages with SEO keywords
result = client.graphql_query("""
  query {
    entities(page: 0, rows: 100) {
      id: iri
      url: string(name: "schema:url")
      seoKeywords: strings(name: "seovoc:seoKeywords")
      topKeywords: topN(
        name: "seovoc:seoKeywords"
        sort: { field: "seovoc:3MonthsImpressions", direction: DESC }
        limit: 3
      ) {
        name: string(name: "seovoc:name")
        impressions: int(name: "seovoc:3MonthsImpressions")
      }
    }
  }
""")
```

## Workflow Patterns

### Post-Import Entity Upgrading

**After importing pages, upgrade entity types and add properties:**

```python
from scripts.entity_upgrader import upgrade_entity, upgrade_batch
from scripts.wordlift_client import WordLiftClient

client = WordLiftClient(api_key)

# Single entity upgrade
upgrade_entity(
    client,
    "https://data.wordlift.io/wl92832/webpage/my-post",
    new_type="Article",
    new_props={
        "author": {
            "@type": "Person",
            "@id": "https://data.wordlift.io/wl92832/person/john-doe",
            "name": "John Doe"
        }
    }
)

# Batch upgrade: WebPage → Article
result = client.graphql_query("""
  query {
    entities(query: { typeConstraint: { in: ["http://schema.org/WebPage"] } }) {
      iri
    }
  }
""")

iris = [e['iri'] for e in result['entities']]
stats = upgrade_batch(client, iris, new_type="Article")
```

**Why Entity Upgrader?**
- ✅ Changes entity types (PATCH can't do this)
- ✅ Preserves existing properties automatically
- ✅ Handles complex nested objects
- ✅ Validates complete entity before upload

**Command-line:**
```bash
# Single entity
python scripts/entity_upgrader.py <IRI> --type Article

# Batch from file
python scripts/entity_upgrader.py --batch-file iris.txt --type Article --props '{...}'
```

See `references/entity-upgrading.md` for complete guide.

### Template Configuration (Before Bulk Import)

**CRITICAL**: Before importing hundreds of pages, configure and validate your markup template using samples.

```python
from scripts.template_configurator import interactive_template_configuration
from scripts.wordlift_client import WordLiftClient

# Select 2-3 representative sample pages
sample_urls = [
    "https://yoursite.com/blog/post-1",
    "https://yoursite.com/blog/post-2",
    "https://yoursite.com/about"
]

client = WordLiftClient(api_key)

# Run interactive configuration
template_config = interactive_template_configuration(
    client,
    dataset_uri,
    sample_urls
)

# Review proposed markup:
# - Entity type (BlogPosting, Article, WebPage)
# - Required properties (author, publisher, datePublished)
# - Metadata extraction (headline, description, image)
# - ID pattern (slug generation)

# User approves template → Proceed with bulk import
```

**Why this is critical:**
- ❌ Without: Import 700 pages with wrong @type, have to delete and re-import
- ✅ With: Get it right the first time, validate on samples before bulk operation

See `references/template-configuration.md` for complete workflow guide.

### Initial Import from Sitemap

1. **Import pages** using Sitemap Import API
2. **Query imported data** to see what was created
3. **Enhance with products** by creating proper Product entities with GS1 IDs
4. **Validate** entity counts and structure

```python
# Step 1: Import
results = client.import_from_sitemap("https://example.com/sitemap.xml")

# Step 2: Query
entities = client.graphql_query("""{ entities(rows: 10) { iri url: string(name: "schema:url") } }""")

# Step 3: Create products
for product_data in products_list:
    product = builder.build_product(product_data)
    client.create_or_update_entity(product)
```

### Daily Sync Strategy

1. **Extract** product data from your source
2. **Query** existing products to identify what's new/changed
3. **Sync** using orchestrator:
   - New products → batch create
   - Existing products → batch update or PATCH
4. **Validate** sync completed successfully

See `references/workflows.md` for detailed workflow patterns.
**For automated scheduling**, see `references/scheduling.md` for cron, GitHub Actions, Docker, and cloud function setups.

```bash
python scripts/kg_sync.py \
  --api-key YOUR_API_KEY \
  --dataset-uri https://data.wordlift.io/wl123 \
  --input products.json \
  --batch-size 50
```

For incremental updates:
```bash
python scripts/kg_sync.py \
  --api-key YOUR_API_KEY \
  --dataset-uri https://data.wordlift.io/wl123 \
  --input products.json \
  --incremental
```

### Handling Large Catalogs

For catalogs >10,000 products:
- Use batch_size=25-50 to avoid timeouts
- Use incremental PATCH for daily updates
- Schedule syncs during off-peak hours
- Monitor import progress with NDJSON streaming

## Script Reference

### `entity_verifier.py`
Verify entity persistence (prevent silent failures):
- `verify_entity_persisted()` - Check if entity is dereferenceable (2 seconds)
- `verify_via_graphql()` - Check GraphQL indexing (10+ seconds)
- `verify_entity_complete()` - Complete verification suite
- `check_iri_pattern()` - Validate IRI follows WordLift patterns
- **CRITICAL**: Always verify after creation - API returns 200 OK even for invalid IRIs

### `entity_upgrader.py`
Upgrade existing entities (Fetch-Modify-Update pattern):
- Change entity types (WebPage → Article)
- Add complex nested properties (author, publisher)
- Preserve existing data automatically
- Batch upgrade from file
- Safer than PATCH for structural changes

### `template_configurator.py`
Configure markup templates before bulk imports:
- `TemplateConfigurator.analyze_sample_pages()` - Analyze sample pages
- `TemplateConfigurator.display_configuration_summary()` - Show analysis summary
- `TemplateConfigurator.generate_configuration_questions()` - Generate config questions
- `TemplateConfigurator.save_template()` - Save approved template
- `interactive_template_configuration()` - Full interactive workflow

### `id_generator.py`
Generate entity IDs:
- `generate_product_id()` - GS1 Digital Link for products
- `generate_entity_id()` - Slug-based for other entities
- `generate_slug()` - Convert text to URL-friendly slug
- `normalize_gtin()` - Convert any GTIN to GTIN-14
- `validate_gtin_check_digit()` - Validate GTIN

### `entity_builder.py`
Build JSON-LD entities:
- `EntityBuilder.build_product()` - Create Product entity
- `EntityBuilder.build_organization()` - Create Organization
- `EntityBuilder.build_webpage()` - Create WebPage
- `create_product_from_scraped_data()` - Auto-map scraped fields

### `entity_reuse.py`
Prevent duplicate entities:
- `EntityReuseManager.get_or_create_organization()` - Reuse organizations
- `EntityReuseManager.get_or_create_person()` - Reuse people
- `EntityReuseManager.get_or_create_brand()` - Reuse brands
- `EntityReuseManager.preload_cache()` - Load existing entities for fast lookup
- `EntityReuseManager.get_existing_entities_by_type()` - Query entities by type

### `shacl_validator.py`
Validate data quality:
- `SHACLValidator.validate()` - Validate single entity
- `SHACLValidator.validate_batch()` - Validate multiple entities
- `SHACLValidator.get_validation_report()` - Generate report
- `validate_before_upload()` - Filter valid/invalid entities

### `wordlift_client.py`
Interact with WordLift APIs:
- `import_from_sitemap()` - Import from sitemap.xml
- `import_from_urls()` - Import specific URLs
- `graphql_query()` - Execute GraphQL queries
- `create_or_update_entity()` - Upsert single entity
- `batch_create_or_update()` - Batch operations
- `patch_entity()` - Incremental updates
- `get_products()`, `get_all_product_gtins()` - Query helpers

### `markup_validator.py`
Validate JSON-LD markup:
- `MarkupValidator.validate()` - Validate single markup
- `MarkupValidator.validate_batch()` - Validate multiple markups
- `validate_json_ld_string()` - Validate JSON-LD from string

### `kg_sync.py`
Orchestrate sync workflows:
- `KGSyncOrchestrator.sync_products()` - Full sync
- `KGSyncOrchestrator.incremental_update()` - PATCH-based sync
- Command-line interface for daily automation
- Flags: `--no-validation`, `--no-reuse` to disable features

### `extract_products.py`
Extract products from data sources:
- `extract_from_database()` - PostgreSQL example
- `extract_from_csv()` - CSV file parsing
- `extract_from_json()` - JSON file parsing
- `extract_from_api()` - REST API example
- `extract_from_shopify()` - Shopify integration
- `extract_from_woocommerce()` - WooCommerce integration

## Dataset URI Structure

WordLift uses account-specific base URIs:

**Format**: `https://data.wordlift.io/wl{account_id}/`

**Examples**:
- Staging: `https://data.wordlift.io/wl1505540/`
- Production: `https://data.wordlift.io/wl1506865/`

All entity IDs are prefixed with this base URI.

## Entity ID Patterns

### Products
`{dataset_uri}/01/{GTIN-14}[/21/{serial}][/10/{lot}]`

### Organizations
`{dataset_uri}/organization/{slug}`

### People
`{dataset_uri}/person/{slug}`

### Web Pages
`{dataset_uri}/webpage/{slug}`

Note: The @id uses this pattern, while the actual page URL is stored in the `url` property.

### Services
`{dataset_uri}/service/{slug}`

### States/Locations
`{dataset_uri}/state/{slug}`

## Error Handling

### Sitemap Import Errors

```python
try:
    results = client.import_from_sitemap(sitemap_url)
    print(f"Successfully imported {len(results)} pages")
except requests.HTTPError as e:
    print(f"Import failed: {e.response.status_code}")
    print(f"Details: {e.response.text}")
```

### Markup Validation Errors

```python
is_valid, errors, markup = validate_markup_from_agent(agent_output)

if not is_valid:
    print("Validation errors:")
    for error in errors:
        print(f"  - {error}")
    # Fix errors before uploading
```

### Invalid GTIN

```python
from scripts.id_generator import normalize_gtin

try:
    gtin_14 = normalize_gtin(user_input)
except ValueError as e:
    print(f"Invalid GTIN: {e}")
```

## Best Practices

1. **Dataset URI**: Use your WordLift account URI (`https://data.wordlift.io/wl{account_id}/`)
2. **IRI Patterns**: ONLY use recognized patterns (organization, place, person, destination, article, etc.)
3. **Always Verify**: Verify entity persistence after creation (API returns 200 OK even for invalid IRIs)
4. **Template Configuration**: ALWAYS configure and validate markup template on sample pages before bulk imports
5. **Entity Reuse**: Always enable entity reuse to prevent duplicate Organizations, Brands, and People
6. **Preload Cache**: Call `reuse_manager.preload_cache()` at start for performance
7. **SHACL Validation**: Always validate entities before upload (enabled by default)
8. **GTIN Quality**: Validate GTINs before sync to prevent ID conflicts
9. **Slug Uniqueness**: Ensure natural keys generate unique slugs
10. **Batch Sizing**: Start with batch_size=50, adjust based on success rate
11. **Validation Mode**: Use strict mode in production for high-quality data
12. **Incremental Syncs**: Use PATCH for daily updates when <20% of products change
13. **Structural Changes**: Use Entity Upgrader (not PATCH) for type changes and complex updates
14. **Monitoring**: Track sync statistics, reuse rates, and validation results
15. **Query After Import**: Verify entity counts after sitemap import
16. **Test Before Bulk**: Import 10-20 pages first to verify configuration
17. **Custom Data**: Use `additionalProperty` instead of custom namespaces

## Common Issues

**Q: Why use Sitemap Import API instead of scraping?**
A: The Sitemap Import API is the recommended way to jumpstart a Knowledge Graph. It:
- Handles pagination and large sitemaps
- Returns structured NDJSON responses
- Automatically extracts structured data from pages
- Respects robots.txt and rate limits

**Q: How do slug-based IDs work?**
A: Slugs are URL-friendly versions of natural keys:
- "Acme Corporation" → "acme-corporation"
- "New York" → "new-york"
- "John Doe" → "john-doe"

This makes IDs human-readable and predictable.

**Q: When to use GS1 Digital Link vs slug-based IDs?**
A: Use GS1 Digital Link ONLY for products with GTINs. Use slug-based IDs for:
- Organizations
- People
- Locations
- Services
- Other non-product entities

**Q: Why is entity reuse important?**
A: Without entity reuse, you create duplicate entities:
- Brand "Nike" created 100 times (once per product)
- Publisher "Acme Corp" created 50 times (once per article)
- Author "John Doe" created 30 times (once per blog post)

Entity reuse via GraphQL ensures you reference the same entity IRI, maintaining data integrity.

**Q: How do I know if entities are being reused?**
A: Check the sync output:
```
✓ Reusing existing brand: Nike
+ Creating new brand: Adidas
✓ Reusing existing organization: Acme Corp
```

Also track reuse statistics in your logs.

**Q: What happens if validation fails?**
A: Invalid entities are filtered out and not uploaded. Check the validation report:
```
✗ Product missing required field: gtin14
✗ Offer: Missing required field: priceCurrency
```

Fix the errors and re-run the sync.

**Q: How do I create JSON-LD markup?**
A: Use the `EntityBuilder` to create entities programmatically:
```python
from scripts.entity_builder import EntityBuilder
builder = EntityBuilder(dataset_uri)
product = builder.build_product({...})
```
Always validate with `SHACLValidator` before uploading.

**Q: Why does the API return 200 OK but my entity isn't persisted?**
A: WordLift requires specific IRI path patterns. The API accepts invalid patterns (returns 200 OK) but doesn't persist them. Always:
1. Use recognized patterns (organization, place, person, destination, etc.)
2. Verify with `verify_entity_persisted()` after creation
3. Check `.html` and `.json` endpoints are accessible

See `references/iri-patterns-and-verification.md` for details.

**Q: When should I use Entity Upgrader vs PATCH?**
A: Use Entity Upgrader (`entity_upgrader.py`) for:
- Changing entity types (WebPage → Article)
- Adding complex nested properties (author, publisher)
- Post-import cleanup/enrichment

Use PATCH (`patch_entity()`) for:
- Daily price/availability updates
- Simple field changes
- Large catalogs with <20% daily changes

**Q: What if sitemap has >1000 URLs?**
A: The Sitemap Import API handles large sitemaps automatically. Monitor the NDJSON response to track progress.

## Dependencies

```bash
pip install requests --break-system-packages
```

No additional dependencies needed.