# data-ingestion > parsing CSV/Excel measurement files, inspecting file structures, or ingesting raw semiconductor data via pandas. - Author: tianyi - Repository: Tiannyi/Sharpener_Cognitive-Architecture-for-AI-agents - Version: 20260130002634 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Tiannyi/Sharpener_Cognitive-Architecture-for-AI-agents - Web: https://mule.run/skillshub/@@Tiannyi/Sharpener_Cognitive-Architecture-for-AI-agents~data-ingestion:20260130002634 --- --- description: parsing CSV/Excel measurement files, inspecting file structures, or ingesting raw semiconductor data via pandas. --- # Sub-Skill #1: File Structure & Data Ingestion > **Priority sub-skill — always load this first. Claude must understand the raw data before any analysis.** ## Core Philosophy **Claude cannot directly read CSV/Excel files reliably.** Always use pandas via the terminal to inspect and process data. Never attempt to raw-read large data files. Work through the terminal, read the output, then reason about it. ## Step 1: Inspect File Names First Before opening any file, extract metadata from the filename itself. Typical naming patterns encode: - **Lot ID** (e.g., `LOT2024A`, `N12345`) - **Wafer number** (e.g., `W03`, `wafer_12`) - **Die coordinates** (e.g., `X5Y3`, `die_0503`) - **Device type** (e.g., `NMOS`, `PNP_5x10`, `MIM_cap`) - **Test condition** (e.g., `25C`, `125C`, `stress_1000s`) - **Date/timestamp** (e.g., `20240115`, `2024-01-15`) - **Measurement type** (e.g., `IdVg`, `IdVd`, `CV`, `Spar`) **Action:** List the files first (`ls` or `glob`), then parse the filenames to understand what you're working with before opening anything. ## Step 2: Inspection-First Workflow For every data file, run these pandas commands in the terminal **before making any analysis decisions**: ```python import pandas as pd # For CSV: df = pd.read_csv('path/to/file.csv', nrows=5) # Read only 5 rows first # For Excel: df = pd.read_excel('path/to/file.xlsx', nrows=5) # Mandatory inspection sequence: print("Shape:", df.shape) print("\nColumns:", df.columns.tolist()) print("\nData types:\n", df.dtypes) print("\nFirst 5 rows:\n", df.head()) print("\nBasic stats:\n", df.describe()) print("\nNull counts:\n", df.isnull().sum()) ``` **Never skip this step.** The inspection output tells you everything you need to plan the analysis. ## Step 3: Large File Strategy Files can be massive (100K+ rows). Rules: 1. **Read headers first** — use `nrows=0` or `nrows=5` to understand structure without loading everything 2. **Check file size** — use `df.shape` on a small read, or `wc -l` in terminal for CSVs 3. **Never print full DataFrames** — always use `.head()`, `.tail()`, or slicing 4. **Filter before displaying** — select specific columns/rows/conditions before printing 5. **Summarize, don't dump** — use `.value_counts()`, `.nunique()`, `.describe()`, `.groupby().agg()` 6. **Chunked reading** for truly massive files: ```python chunks = pd.read_csv('file.csv', chunksize=10000) for chunk in chunks: # process each chunk ``` 7. **For Excel with multiple sheets:** ```python xls = pd.ExcelFile('file.xlsx') print("Sheet names:", xls.sheet_names) # Read one sheet at a time df = pd.read_excel(xls, sheet_name='Sheet1', nrows=5) ``` ## Step 4: Data Structure Recognition After inspection, classify the data type based on columns and filename: | If you see columns like... | It's likely... | |---------------------------|---------------| | Vg, Id, Ig (or Gate Voltage, Drain Current) | Id-Vg sweep | | Vd, Id (or Drain Voltage, Drain Current) | Id-Vd sweep | | Frequency, Capacitance, Conductance | CV measurement | | Frequency, S11, S21, S12, S22 | S-parameter data | | Time, Voltage, Current with stress labels | Reliability/stress data | | Lot, Wafer, Die, Parameter, Value | Summary/parametric data | ## Step 5: Metadata Extraction Pipeline Systematically extract from both filename and file contents: ```python # From filename import re filename = "LOT2024A_W03_X5Y3_NMOS_5x10_IdVg_25C.csv" # Parse with regex or split # From file contents (headers, first rows, or dedicated metadata rows) # Some measurement tools put metadata in the first N rows before the actual data # Check if the first few rows are metadata vs. column headers ``` ## Step 6: Data Cleaning Rules - **Missing values:** Identify and decide — drop, fill, or flag? (Domain-dependent) - **Malformed rows:** Measurement tools sometimes insert status lines mid-file. Detect and skip. - **Unit inconsistencies:** Check if currents are in A, mA, or uA. Voltages in V or mV. Normalize. - **Duplicate measurements:** Check for repeated die/condition combinations. Decide: keep all, average, or flag. ## Step 7: Cross-Check and Confirmation **Never trust a single pass.** After any parsing or transformation: 1. **Print a summary** of the result (shape, head, dtypes) 2. **Spot-check** — pick 2-3 raw values and verify they match the processed output 3. **Row count audit** — confirm no silent row drops (compare raw count vs. processed count) 4. **Group count check** — verify grouping/filtering didn't accidentally exclude data 5. **Show the user a sanity-check summary** before proceeding: ``` Parsed: 3 lots, 12 wafers, 5 device types, 2 conditions Total measurements: 14,400 rows Any nulls: 23 (0.16%) — in column 'Ig' Ready to proceed? [show summary table] ``` ## Output After completing ingestion, you should have: - A clean pandas DataFrame (or set of DataFrames) - A metadata dictionary (lot, wafer, die, device, condition info) - A classification of the measurement type - Confidence that the data is correctly parsed (verified via cross-checks) **Next:** Route to `condition-logic/SKILL.md` to determine what comparisons are meaningful.