# dataset-management > Use when creating datasets, uploading files, managing schemas, or configuring dataset connections - Author: dym-ai - Repository: dym-ai/dataiku-chat-control - Version: 20260206165337 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/dym-ai/dataiku-chat-control - Web: https://mule.run/skillshub/@@dym-ai/dataiku-chat-control~dataset-management:20260206165337 --- --- name: dataset-management description: "Use when creating datasets, uploading files, managing schemas, or configuring dataset connections" --- # Dataset Management Patterns Reference patterns for creating and managing Dataiku datasets via the Python API. ## Dataset Types | Type | Use When | Creation Method | |------|----------|-----------------| | **Managed** | Output of recipes, stored in a connection (SQL, HDFS, etc.) | `project.new_managed_dataset(name)` | | **Uploaded** | Importing local files (CSV, Excel, etc.) | `project.create_dataset(name, "UploadedFiles", ...)` | | **SQL Table** | Pointing to an existing database table | `project.create_dataset(name, "Snowflake", ...)` | ## Create a Managed Dataset ```python builder = project.new_managed_dataset("MY_OUTPUT") builder.with_store_into("connection_name") ds = builder.create() # Configure table location (SQL databases) settings = ds.get_settings() raw = settings.get_raw() raw["params"]["schema"] = "MY_SCHEMA" raw["params"]["table"] = "MY_OUTPUT" settings.save() ``` ## Upload a File ```python ds = project.create_dataset( "my_dataset", "UploadedFiles", params={"uploadConnection": "filesystem_managed"} ) ds.uploaded_add_file("path/to/data.csv") # Auto-detect schema from file contents settings = ds.get_settings() settings.autodetect_settings(infer_storage_types=True) settings.save() ``` ## Common Column Types | Dataiku Type | Description | |--------------|-------------| | `string` | Text | | `int` / `bigint` | Integer / Large integer | | `double` / `float` | Decimal numbers | | `boolean` | True/False | | `date` | Date only | See [references/column-types.md](references/column-types.md) for the full type table. ## Core Schema Operations ### Get Schema ```python ds = project.get_dataset("my_dataset") schema = ds.get_settings().get_schema() for col in schema["columns"]: print(f"{col['name']}: {col['type']}") ``` ### Set Schema ```python settings = ds.get_settings() settings.set_schema({"columns": [ {"name": "id", "type": "string"}, {"name": "amount", "type": "double"}, ]}) settings.save() ``` ### Auto-detect Schema ```python dataset.autodetect_settings() settings = dataset.get_settings() settings.save() ``` See [references/schema-operations.md](references/schema-operations.md) for join compatibility checks, helper functions, and advanced operations. ## SQL Schema Rule Output datasets for SQL-based recipes **MUST** have schemas set before building. Without this, Dataiku generates `CREATE TABLE () ...` which fails. For SQL databases (Snowflake, BigQuery), use **UPPERCASE** column names. Lowercase names get quoted, causing "invalid identifier" errors. ```python # Normalize column names to uppercase for SQL raw = settings.get_raw() for col in raw.get("schema", {}).get("columns", []): col["name"] = col["name"].upper() settings.save() ``` ## List Datasets in Project ```python datasets = project.list_datasets() for ds in datasets: print(f"- {ds['name']} ({ds.get('type', 'unknown')})") ``` ## Common Issues | Issue | Cause | Solution | |-------|-------|----------| | Schema mismatch | Recipe output doesn't match | Run `autodetect_settings()` | | Join fails | Key type mismatch | Check types, cast if needed | | Missing columns | Schema not updated | Rebuild dataset, update schema | | Parse errors | Wrong type detection | Manually set schema | ## Detailed References - [references/column-types.md](references/column-types.md) — Full column type table with Python equivalents - [references/schema-operations.md](references/schema-operations.md) — All schema operations, join compatibility checks, helper functions