# xlsx > Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. Provides intelligent header detection and semantic column mapping for robust data import. - Author: GrupoUS - Repository: GrupoUS/gpus - Version: 20260122001702 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/GrupoUS/gpus - Web: https://mule.run/skillshub/@@GrupoUS/gpus~xlsx:20260122001702 --- --- name: xlsx description: "Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. Provides intelligent header detection and semantic column mapping for robust data import." license: Proprietary. LICENSE.txt has complete terms --- # Intelligence Capabilities (New) When working with imperfect input files (e.g., user uploads, legacy system exports), apply the following intelligent processing strategies. ## 1. Intelligent Header Detection **Full Guide**: [header-detection.md](intelligence/header-detection.md) Never assume the first row is the header. Use the following heuristics: - **Semantic Analysis**: Look for row content matching known keywords using Levenshtein distance (e.g., "Telefome" ≈ "Telefone"). - **Data Pattern Analysis**: The header is typically the last text-heavy row before data-heavy rows (dates, emails, CPFs). - **Context**: Headers have high density (few empty cells) and are often below metadata rows (titles, dates). ## 2. Semantic Column Mapping **Full Guide**: [column-mapping.md](intelligence/column-mapping.md) Map source columns to target fields using specific scoring: - **Semantic (40%)**: Normalization + Synonyms (e.g., "Zap" -> "Phone"). - **Pattern (30%)**: Data validation samples (e.g., column has `\d{11}` -> "CPF"). - **Context (30%)**: Position and frequency of occurrence. ## 3. Real-World Examples **See**: [examples.md](intelligence/examples.md) for handling: - Metadata headers (Reports with titles) - Ambiguous columns (Date formats) - Unknown columns --- # Requirements for Outputs ## All Excel files ### Zero Formula Errors - Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?) ### Preserve Existing Templates (when updating templates) - Study and EXACTLY match existing format, style, and conventions when modifying files - Never impose standardized formatting on files with established patterns - Existing template conventions ALWAYS override these guidelines ## Financial models ### Color Coding Standards Unless otherwise stated by the user or existing template #### Industry-Standard Color Conventions - **Blue text (RGB: 0,0,255)**: Hardcoded inputs, and numbers users will change for scenarios - **Black text (RGB: 0,0,0)**: ALL formulas and calculations - **Green text (RGB: 0,128,0)**: Links pulling from other worksheets within same workbook - **Red text (RGB: 255,0,0)**: External links to other files - **Yellow background (RGB: 255,255,0)**: Key assumptions needing attention or cells that need to be updated ### Number Formatting Standards #### Required Format Rules - **Years**: Format as text strings (e.g., "2024" not "2,024") - **Currency**: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)") - **Zeros**: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-") - **Percentages**: Default to 0.0% format (one decimal) - **Multiples**: Format as 0.0x for valuation multiples (EV/EBITDA, P/E) - **Negative numbers**: Use parentheses (123) not minus -123 ### Formula Construction Rules #### Assumptions Placement - Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells - Use cell references instead of hardcoded values in formulas - Example: Use =B5*(1+$B$6) instead of =B5*1.05 #### Formula Error Prevention - Verify all cell references are correct - Check for off-by-one errors in ranges - Ensure consistent formulas across all projection periods - Test with edge cases (zero values, negative numbers) - Verify no unintended circular references #### Documentation Requirements for Hardcodes - Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]" - Examples: - "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]" - "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]" - "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity" - "Source: FactSet, 8/20/2025, Consensus Estimates Screen" # XLSX creation, editing, and analysis ## Overview A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks. ## Important Requirements **LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run ## Reading and analyzing data ### Data analysis with pandas For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities: ```python import pandas as pd # Read Excel df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict # Analyze df.head() # Preview data df.info() # Column info df.describe() # Statistics # Write Excel df.to_excel('output.xlsx', index=False) ``` ## Excel File Workflows ## CRITICAL: Use Formulas, Not Hardcoded Values **Always use Excel formulas instead of calculating values in Python and hardcoding them.** This ensures the spreadsheet remains dynamic and updateable. ### ❌ WRONG - Hardcoding Calculated Values ```python # Bad: Calculating in Python and hardcoding result total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000 # Bad: Computing growth rate in Python growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue'] sheet['C5'] = growth # Hardcodes 0.15 # Bad: Python calculation for average avg = sum(values) / len(values) sheet['D20'] = avg # Hardcodes 42.5 ``` ### ✅ CORRECT - Using Excel Formulas ```python # Good: Let Excel calculate the sum sheet['B10'] = '=SUM(B2:B9)' # Good: Growth rate as Excel formula sheet['C5'] = '=(C4-C2)/C2' # Good: Average using Excel function sheet['D20'] = '=AVERAGE(D2:D19)' ``` This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes. ## Common Workflow 1. **Choose tool**: pandas for data, openpyxl for formulas/formatting 2. **Create/Load**: Create new workbook or load existing file 3. **Modify**: Add/edit data, formulas, and formatting 4. **Save**: Write to file 5. **Recalculate formulas (MANDATORY IF USING FORMULAS)**: Use the recalc.py script ```bash python recalc.py output.xlsx ``` 6. **Verify and fix any errors**: - The script returns JSON with error details - If `status` is `errors_found`, check `error_summary` for specific error types and locations - Fix the identified errors and recalculate again - Common errors to fix: - `#REF!`: Invalid cell references - `#DIV/0!`: Division by zero - `#VALUE!`: Wrong data type in formula - `#NAME?`: Unrecognized formula name ### Creating new Excel files ```python # Using openpyxl for formulas and formatting from openpyxl import Workbook from openpyxl.styles import Font, PatternFill, Alignment wb = Workbook() sheet = wb.active # Add data sheet['A1'] = 'Hello' sheet['B1'] = 'World' sheet.append(['Row', 'of', 'data']) # Add formula sheet['B2'] = '=SUM(A1:A10)' # Formatting sheet['A1'].font = Font(bold=True, color='FF0000') sheet['A1'].fill = PatternFill('solid', start_color='FFFF00') sheet['A1'].alignment = Alignment(horizontal='center') # Column width sheet.column_dimensions['A'].width = 20 wb.save('output.xlsx') ``` ### Editing existing Excel files ```python # Using openpyxl to preserve formulas and formatting from openpyxl import load_workbook # Load existing file wb = load_workbook('existing.xlsx') sheet = wb.active # or wb['SheetName'] for specific sheet # Working with multiple sheets for sheet_name in wb.sheetnames: sheet = wb[sheet_name] print(f"Sheet: {sheet_name}") # Modify cells sheet['A1'] = 'New Value' sheet.insert_rows(2) # Insert row at position 2 sheet.delete_cols(3) # Delete column 3 # Add new sheet new_sheet = wb.create_sheet('NewSheet') new_sheet['A1'] = 'Data' wb.save('modified.xlsx') ``` ## Recalculating formulas Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided `recalc.py` script to recalculate formulas: ```bash python recalc.py [timeout_seconds] ``` Example: ```bash python recalc.py output.xlsx 30 ``` The script: - Automatically sets up LibreOffice macro on first run - Recalculates all formulas in all sheets - Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.) - Returns JSON with detailed error locations and counts - Works on both Linux and macOS ## Formula Verification Checklist Quick checks to ensure formulas work correctly: ### Essential Verification - [ ] **Test 2-3 sample references**: Verify they pull correct values before building full model - [ ] **Column mapping**: Confirm Excel columns match (e.g., column 64 = BL, not BK) - [ ] **Row offset**: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6) ### Common Pitfalls - [ ] **NaN handling**: Check for null values with `pd.notna()` - [ ] **Far-right columns**: FY data often in columns 50+ - [ ] **Multiple matches**: Search all occurrences, not just first - [ ] **Division by zero**: Check denominators before using `/` in formulas (#DIV/0!) - [ ] **Wrong references**: Verify all cell references point to intended cells (#REF!) - [ ] **Cross-sheet references**: Use correct format (Sheet1!A1) for linking sheets ### Formula Testing Strategy - [ ] **Start small**: Test formulas on 2-3 cells before applying broadly - [ ] **Verify dependencies**: Check all cells referenced in formulas exist - [ ] **Test edge cases**: Include zero, negative, and very large values ### Interpreting recalc.py Output The script returns JSON with error details: ```json { "status": "success", // or "errors_found" "total_errors": 0, // Total error count "total_formulas": 42, // Number of formulas in file "error_summary": { // Only present if errors found "#REF!": { "count": 2, "locations": ["Sheet1!B5", "Sheet1!C10"] } } } ``` ## Best Practices ### Library Selection - **pandas**: Best for data analysis, bulk operations, and simple data export - **openpyxl**: Best for complex formatting, formulas, and Excel-specific features ### Working with openpyxl - Cell indices are 1-based (row=1, column=1 refers to cell A1) - Use `data_only=True` to read calculated values: `load_workbook('file.xlsx', data_only=True)` - **Warning**: If opened with `data_only=True` and saved, formulas are replaced with values and permanently lost - For large files: Use `read_only=True` for reading or `write_only=True` for writing - Formulas are preserved but not evaluated - use recalc.py to update values ### Working with pandas - Specify data types to avoid inference issues: `pd.read_excel('file.xlsx', dtype={'id': str})` - For large files, read specific columns: `pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])` - Handle dates properly: `pd.read_excel('file.xlsx', parse_dates=['date_column'])` ## Code Style Guidelines **IMPORTANT**: When generating Python code for Excel operations: - Write minimal, concise Python code without unnecessary comments - Avoid verbose variable names and redundant operations - Avoid unnecessary print statements **For Excel files themselves**: - Add comments to cells with complex formulas or important assumptions - Document data sources for hardcoded values - Include notes for key calculations and model sections --- # Intelligent Spreadsheet Import and Analysis ## Overview When importing or analyzing spreadsheets, especially for data import workflows, intelligent detection and mapping capabilities are essential for handling real-world spreadsheets that may have: - Non-standard layouts (metadata rows, empty rows, merged cells) - Varied column naming (synonyms, abbreviations, typos) - Multiple data types requiring pattern recognition - Contextual relationships between columns This section provides advanced techniques for intelligent header detection, column mapping, and data pattern analysis. --- ## Intelligent Header Detection ### Overview Header detection is the process of identifying which row in a spreadsheet contains the column names. Real-world spreadsheets often have: - Title rows or metadata at the top - Empty separator rows - Multi-line headers (merged cells) - Headers that don't start at row 1 ### Detection Strategy: Multi-Factor Analysis Use a combination of semantic analysis, pattern recognition, and contextual clues to identify headers with high confidence. #### 1. Semantic Analysis **String Similarity Algorithms** Use similarity metrics to compare row cells against known header keywords. Implement Levenshtein distance, Jaro-Winkler similarity, and normalization functions to handle: - Accent removal (á → a) - Case insensitivity - Whitespace normalization - Synonym matching **Synonym Recognition** Maintain a knowledge base of header synonyms and variations in multiple languages (Portuguese, English) and handle: - Regional variations ("celular" vs "móvel") - Common typos ("E-mail" vs "Email") - Abbreviations ("Tel" = "Telefone") - Context-dependent names ("Cliente" = "Nome" in CRM context) #### 2. Pattern-Based Data Type Detection Analyze sample data values (first 5-10 rows) to infer column types: - **CPF**: 11 digits, format XXX.XXX.XXX-XX - **Email**: Contains @ and valid domain - **Phone**: 10-11 digits, Brazilian formats - **Date**: Multiple formats (DD/MM/YYYY, MM/DD/YYYY, Excel serial) - **Monetary**: R$, points, commas (Brazilian format) - **CEP**: 8 digits, format XXXXX-XXX Use pattern matching with regex and validation functions to detect types with confidence scores. #### 3. Contextual Analysis Consider multiple contextual factors: - **Position**: Headers usually in first 5 rows - **Density**: Headers have many filled cells - **Consistency**: Headers are text, data rows vary in type - **Structure**: Headers have similar length/format ### Multi-Factor Scoring Calculate header likelihood score using: - Text ratio (30%): Headers are usually text, not numbers - Keyword matches (40%): Match against known header keywords - Fill ratio (20%): Headers usually have many filled cells - Position bonus (10%): Headers usually in first rows ### Thresholds and Confidence Levels - **Confidence ≥ 0.8**: High confidence - use automatically - **Confidence 0.6-0.8**: Medium confidence - show to user for confirmation - **Confidence < 0.6**: Low confidence - require user selection ### Handling Edge Cases - **Multiple Header Rows**: Detect merged cells spanning multiple rows - **Empty Rows**: Skip empty separator rows before data - **Metadata Rows**: Identify and skip title/metadata rows at top --- ## Intelligent Column Mapping ### Overview Column mapping matches spreadsheet column names to system fields. Real-world spreadsheets use varied naming that requires intelligent matching. ### Multi-Factor Scoring System Use weighted scoring to determine best mapping: - **Semantic Similarity (40%)**: String similarity and synonym matching - **Data Pattern Match (30%)**: Type inferred from actual values - **Position Context (15%)**: Expected position of field - **Frequency/Common Name (15%)**: Most common names get priority ### Expanded Knowledge Base Maintain comprehensive synonym mappings including: - Portuguese and English variations - Regional differences - Common typos and abbreviations - Context-dependent synonyms **Example Knowledge Base Structure:** ```python # Comprehensive header knowledge base for student/CRM imports EXPANDED_HEADER_KNOWLEDGE = { # Name variations (regional, typos, languages) 'name': [ # Portuguese standard 'nome', 'nome completo', 'aluno', 'cliente', 'pessoa', 'nome do aluno', 'nome do cliente', 'razao social', 'razão social', # English 'name', 'full name', 'client name', 'student name', 'person', 'fullname', # Abbreviations 'nm', 'nome compl', 'nome compl.', # Common typos 'nme', 'nomee', 'nom', 'nomes', # Context variations 'cadastro', 'registro', 'identificação' ], # Email variations 'email': [ # Portuguese 'email', 'e-mail', 'correio eletrônico', 'correio eletronico', 'correio', 'endereço de email', 'endereco de email', 'endereço email', # English 'mail', 'e-mail address', 'email address', 'mail address', # Abbreviations 'e-mail', 'mail addr', 'email addr', # Common typos 'emial', 'emal', 'eail', 'emai', # Variations 'correio eletrônico', 'correio eletr.' ], # Phone variations 'phone': [ # Portuguese 'telefone', 'celular', 'whatsapp', 'contato', 'telefone celular', 'telefone fixo', 'fone', 'tel', 'telefone residencial', # English 'phone', 'mobile', 'cell phone', 'telephone', 'contact number', 'mobile phone', 'cell', # Regional 'móvel', 'movel', 'cel', # Abbreviations 'tel', 'fone', 'cel', 'whats', 'wpp', # Common typos 'telefne', 'telefonee', 'celularr' ], # CPF variations 'cpf': [ 'cpf', 'documento', 'documento cpf', 'cpf/cnpj', 'cpf cnpj', 'identificação', 'identificacao', 'id', 'rg', 'documento de identidade', 'doc', 'document', 'identifier', 'cpf cnpj', 'documento único', 'documento unico', 'cpf documento', 'número cpf', 'numero cpf' ], # Address variations 'address': [ 'endereço', 'endereco', 'rua', 'logradouro', 'street', 'address', 'endereço completo', 'endereco completo', 'full address', 'endereço residencial', 'endereco residencial', 'logradouro completo', 'rua endereço' ], 'addressNumber': [ 'número', 'numero', 'nº', 'n°', 'num', 'número do endereço', 'numero do endereco', 'address number', 'house number', 'nro' ], 'complement': [ 'complemento', 'complement', 'apto', 'apartamento', 'bloco', 'bloco apto', 'andar', 'sala', 'complemento endereço' ], 'neighborhood': [ 'bairro', 'neighborhood', 'distrito', 'bairro residencial', 'bairro endereço', 'zona', 'região', 'regiao' ], 'city': [ 'cidade', 'municipio', 'município', 'city', 'municipality', 'munic', 'cid', 'município de residência', 'cidade residência' ], 'state': [ 'estado', 'uf', 'state', 'province', 'est', 'uf estado', 'estado uf', 'unidade federativa', 'estado brasileiro' ], 'zipCode': [ 'cep', 'código postal', 'codigo postal', 'zip', 'postal code', 'cep código', 'postal', 'cep código postal', 'zip code', 'código postal cep' ], 'country': [ 'país', 'pais', 'country', 'nacionalidade', 'país de origem', 'pais de origem', 'country of origin' ], # Professional fields 'profession': [ 'profissão', 'profissao', 'graduação', 'graduacao', 'formação', 'formacao', 'profession', 'graduation', 'formation', 'curso', 'área de atuação', 'area de atuacao', 'especialidade', 'especialização' ], 'hasClinic': [ 'tem clínica', 'tem clinica', 'possui clínica', 'possui clinica', 'tem estabelecimento', 'has clinic', 'own clinic', 'clinic owner', 'clínica própria', 'clinica propria' ], 'clinicName': [ 'nome clínica', 'nome clinica', 'clínica nome', 'clinica nome', 'nome estabelecimento', 'nome do estabelecimento', 'clinic name', 'estabelecimento', 'nome da clínica' ], 'clinicCity': [ 'cidade clínica', 'cidade clinica', 'cidade estabelecimento', 'município clínica', 'municipio clinica', 'clinic city', 'cidade da clínica' ], # Status and dates 'status': [ 'status', 'situação', 'situacao', 'estado', 'condição', 'condicao', 'status aluno', 'situação aluno', 'status do aluno' ], 'birthDate': [ 'data nasc', 'data de nascimento', 'nascimento', 'data nascimento', 'data nasc.', 'birth date', 'date of birth', 'dob', 'data nascimento aluno', 'nasc', 'data nascimento aluno' ], 'saleDate': [ 'data da venda', 'data venda', 'data de venda', 'venda data', 'data contrato', 'data do contrato', 'sale date', 'contract date', 'data assinatura', 'data de assinatura' ], 'startDate': [ 'data início', 'data inicio', 'data_inicio', 'start date', 'inicio', 'data de início', 'data de inicio', 'data início curso', 'início curso', 'inicio curso' ], # Sales and marketing 'salesperson': [ 'vendedor', 'vendedora', 'vendedor responsável', 'vendedor responsavel', 'salesperson', 'seller', 'vendedor da venda', 'responsável venda', 'responsavel venda' ], 'leadSource': [ 'origem lead', 'origem', 'fonte', 'lead source', 'source', 'origem do lead', 'fonte do lead', 'como conheceu', 'origem cliente', 'canal', 'canal de origem', 'fonte de origem' ], 'cohort': [ 'turma', 'cohort', 'turma curso', 'turma do curso', 'grupo', 'turma aluno', 'classe', 'batch', 'lote' ], # Financial fields 'totalValue': [ 'valor total', 'valortotal', 'valor_total', 'total', 'amount', 'valor', 'preço', 'preco', 'price', 'total value', 'total amount', 'valor do curso', 'valor curso', 'valor contrato', 'valor total curso', 'montante', 'valor pago', 'valor investido' ], 'installments': [ 'parcelas', 'número de parcelas', 'numero de parcelas', 'qtd parcelas', 'quantidade de parcelas', 'installments', 'parc', 'qtd parc', 'número parcelas', 'numero parcelas', 'total parcelas', 'qtd de parcelas', 'quantidade parcelas' ], 'installmentValue': [ 'valor parcela', 'valor_parcela', 'valorparcela', 'installment value', 'valor da parcela', 'valor de parcela', 'parcela valor', 'valor mensal', 'mensalidade', 'valor por parcela' ], 'paymentStatus': [ 'status pagamento', 'status_pagamento', 'payment status', 'pagamento', 'status de pagamento', 'situação pagamento', 'situacao pagamento', 'payment', 'status pag', 'situação pag', 'condição pagamento', 'condicao pagamento' ], 'paidInstallments': [ 'parcelas pagas', 'parcelas_pagas', 'paid installments', 'parcelas quitadas', 'parcelas pagas total', 'qtd parcelas pagas', 'quantidade parcelas pagas', 'parcelas já pagas', 'parcelas ja pagas' ], # Professional ID 'professionalId': [ 'registro', 'registro profissional', 'coren', 'cro', 'crm', 'crf', 'número registro', 'numero registro', 'registro número', 'professional id', 'professional registration', 'número conselho', 'numero conselho', 'conselho profissional', 'registro conselho' ], # Contract status 'contractStatus': [ 'status contrato', 'status do contrato', 'situação contrato', 'situacao contrato', 'contract status', 'status do contrato', 'condição contrato', 'condicao contrato', 'estado contrato' ] } def find_best_match(column_header: str, knowledge_base: Dict[str, List[str]]) -> List[Tuple[str, float]]: """ Find best matching schema fields for a column header. Returns list of (field_name, confidence_score) tuples sorted by score. """ normalized_header = normalize_header(column_header) matches = [] for field_name, synonyms in knowledge_base.items(): best_score = 0.0 for synonym in synonyms: normalized_synonym = normalize_header(synonym) # Exact match if normalized_header == normalized_synonym: best_score = 1.0 break # Contains match elif normalized_synonym in normalized_header or normalized_header in normalized_synonym: best_score = max(best_score, 0.9) else: # Similarity match similarity = jaro_winkler_similarity(normalized_header, normalized_synonym) if similarity > 0.7: best_score = max(best_score, similarity * 0.85) if best_score > 0.5: # Only include reasonable matches matches.append((field_name, best_score)) # Sort by score descending matches.sort(key=lambda x: x[1], reverse=True) return matches ``` ### Intelligent Suggestions When confidence is low (< 70%), provide multiple mapping options: - Show top 3-5 suggestions ordered by score - Include reasoning for each suggestion - Highlight pattern mismatches (e.g., column named "ID" but contains CPF values) --- ## Data Pattern Analysis ### Overview Pattern analysis identifies data types and formats by examining actual values, not just column names. This is crucial for: - Validating mappings - Detecting errors - Normalizing data formats - Providing user feedback ### Pattern Recognition Define comprehensive patterns for common data types: - **CPF**: Brazilian tax ID with validation algorithm - **Email**: Standard email format validation - **Phone (Brazilian)**: 10-11 digits with formatting - **Date**: Multiple format support (DD/MM/YYYY, Excel serial) - **Monetary (Brazilian)**: R$ format with comma decimal separator - **CEP**: Brazilian postal code (8 digits) - **UF**: Brazilian state abbreviations (2 letters) ### Validation and Normalization For each pattern type: - **Regex matching**: Initial pattern detection - **Validator function**: Domain-specific validation (e.g., CPF check digits) - **Normalizer function**: Convert to standard format - **Formatter function**: Display in user-friendly format ### Pattern Detection Process 1. Sample first 10-20 rows of data 2. Test each value against pattern regexes 3. Run validator functions for matched patterns 4. Calculate match ratio and confidence 5. Return best match with confidence score --- ## Integration with Existing Code ### Integration with xlsx-helper.ts Enhance `detectHeaderRow()` function in `src/lib/xlsx-helper.ts`: **Current Implementation:** ```typescript // Current basic implementation export function detectHeaderRow(rows: unknown[][], maxRowsToScan = 15): HeaderDetectionResult { // Basic keyword matching and scoring } ``` **Enhanced Implementation:** ```typescript // Enhanced with multi-factor analysis export function detectHeaderRowEnhanced( rows: unknown[][], maxRowsToScan = 15 ): HeaderDetectionResult { const candidates: Array<{ rowIndex: number; score: number; headers: string[]; }> = []; // Factor 1: Text ratio (headers are usually text) const calculateTextRatio = (row: unknown[]): number => { const textCells = row.filter( cell => cell && !String(cell).trim().match(/^\d+([.,]\d+)?$/) ); return textCells.length / Math.max(row.length, 1); }; // Factor 2: Keyword matches const countKeywordMatches = (row: unknown[]): number => { const normalizedRow = row.map(cell => normalizeHeader(String(cell || '')) ); return normalizedRow.filter(cell => HEADER_KEYWORDS.some(kw => cell.includes(kw) || kw.includes(cell)) ).length; }; // Factor 3: Fill ratio const calculateFillRatio = (row: unknown[]): number => { const nonEmpty = row.filter(cell => cell && String(cell).trim()); return nonEmpty.length / Math.max(row.length, 1); }; // Factor 4: Pattern consistency (compare with next row) const checkPatternConsistency = ( row: unknown[], nextRow: unknown[], index: number ): number => { if (index + 1 >= rows.length) return 0; const rowTypes = row.map(cell => typeof cell); const nextTypes = nextRow.map(cell => typeof cell); // Headers should have different types than data const typeOverlap = rowTypes.filter((t, i) => t === nextTypes[i]).length; return typeOverlap < rowTypes.length / 2 ? 5 : 0; }; // Calculate scores for each candidate row for (let i = 0; i < Math.min(rows.length, maxRowsToScan); i++) { const row = rows[i]; if (!row || row.length === 0) continue; let score = 0; // Factor 1: Text ratio (30 points max) score += calculateTextRatio(row) * 30; // Factor 2: Keyword matches (40 points max) score += Math.min(countKeywordMatches(row) * 10, 40); // Factor 3: Fill ratio (20 points max) score += calculateFillRatio(row) * 20; // Factor 4: Position bonus (10 points) if (i < 5) score += 10; // Factor 5: Pattern consistency (5 points) score += checkPatternConsistency(row, rows[i + 1] || [], i); const headers = row .map(h => String(h ?? '').trim()) .filter(Boolean); if (headers.length >= 2) { candidates.push({ rowIndex: i, score, headers, }); } } if (candidates.length === 0) { return { headerRowIndex: 0, confidence: 0, headers: rows[0]?.map(h => String(h ?? '').trim()).filter(Boolean) || [], candidates: [], }; } candidates.sort((a, b) => b.score - a.score); const best = candidates[0]; // Calculate confidence let confidence: number; if (candidates.length === 1) { confidence = Math.min(best.score / 100, 1); } else { const scoreDiff = best.score - candidates[1].score; const relativeConfidence = scoreDiff / Math.max(best.score, 1); const absoluteConfidence = best.score / 100; confidence = Math.min((relativeConfidence + absoluteConfidence) / 2, 1); } return { headerRowIndex: best.rowIndex, confidence: Math.round(confidence * 100) / 100, headers: best.headers, candidates: candidates.slice(0, 5), }; } // Helper function for header normalization function normalizeHeader(header: string): string { return header .toLowerCase() .normalize('NFD') .replace(/[\u0300-\u036f]/g, '') // Remove accents .trim() .replace(/\s+/g, ' '); } ``` ### Integration with csv-validator.ts Enhance `mapCSVHeaders()` function: **Current Implementation:** ```typescript // Current static mapping export function mapCSVHeaders(headers: string[]): Record { const mapping: Record = {}; for (const header of headers) { const normalized = header.trim().toLowerCase(); if (HEADER_MAP[normalized]) { mapping[header] = HEADER_MAP[normalized]; } } return mapping; } ``` **Enhanced Implementation:** ```typescript // Enhanced with pattern analysis and fuzzy matching export interface ColumnMappingResult { mapping: Record; confidence: Record; suggestions: Record>; } export function mapCSVHeadersIntelligent( headers: string[], sampleRows: Record[], maxSuggestions = 3 ): ColumnMappingResult { const mapping: Record = {}; const confidence: Record = {}; const suggestions: Record> = {}; for (const header of headers) { // Get column data const columnData = sampleRows .map(row => row[header]) .filter(v => v !== undefined && v !== null) .slice(0, 10); // Sample first 10 rows // Detect pattern const pattern = detectColumnPattern(columnData); // Find best matches const matches = findBestSchemaMatches(header, pattern, headers.indexOf(header)); if (matches.length > 0 && matches[0].score >= 0.5) { mapping[header] = matches[0].field; confidence[header] = matches[0].score; } // Store suggestions for low confidence matches if (matches.length > 0 && matches[0].score < 0.7) { suggestions[header] = matches.slice(0, maxSuggestions).map(m => ({ field: m.field, score: m.score, reason: m.reason, })); } } return { mapping, confidence, suggestions }; } // Pattern detection function function detectColumnPattern(values: unknown[]): { type: string; confidence: number; matchRatio: number; } { if (values.length === 0) { return { type: 'empty', confidence: 0, matchRatio: 0 }; } // CPF pattern const cpfPattern = /^\d{3}[.\-]?\d{3}[.\-]?\d{3}[.\-]?\d{2}$/; const cpfMatches = values.filter(v => cpfPattern.test(String(v).replace(/\s/g, ''))); if (cpfMatches.length / values.length > 0.7) { return { type: 'cpf', confidence: cpfMatches.length / values.length, matchRatio: cpfMatches.length / values.length, }; } // Email pattern const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/; const emailMatches = values.filter(v => emailPattern.test(String(v).toLowerCase())); if (emailMatches.length / values.length > 0.7) { return { type: 'email', confidence: emailMatches.length / values.length, matchRatio: emailMatches.length / values.length, }; } // Phone pattern (Brazilian) const phonePattern = /^[\d\s\(\)\-\+]{10,15}$/; const phoneMatches = values.filter(v => { const digits = String(v).replace(/\D/g, ''); return phonePattern.test(String(v)) && digits.length >= 10 && digits.length <= 11; }); if (phoneMatches.length / values.length > 0.7) { return { type: 'phone', confidence: phoneMatches.length / values.length, matchRatio: phoneMatches.length / values.length, }; } return { type: 'text', confidence: 0.3, matchRatio: 0 }; } // Find best schema matches with scoring function findBestSchemaMatches( header: string, pattern: { type: string; confidence: number }, columnIndex: number ): Array<{ field: string; score: number; reason: string }> { const normalizedHeader = normalizeHeader(header); const matches: Array<{ field: string; score: number; reason: string }> = []; // Check against expanded knowledge base for (const [field, synonyms] of Object.entries(EXPANDED_HEADER_KNOWLEDGE)) { let semanticScore = 0; let reason = ''; // Check synonyms for (const synonym of synonyms) { const normalizedSynonym = normalizeHeader(synonym); if (normalizedHeader === normalizedSynonym) { semanticScore = 1.0; reason = `Exact match with '${synonym}'`; break; } else if (normalizedSynonym.includes(normalizedHeader) || normalizedHeader.includes(normalizedSynonym)) { semanticScore = Math.max(semanticScore, 0.9); reason = `Contains keyword '${synonym}'`; } else { // Fuzzy match const similarity = jaroWinklerSimilarity(normalizedHeader, normalizedSynonym); if (similarity > 0.7) { semanticScore = Math.max(semanticScore, similarity * 0.85); reason = `Similar to '${synonym}' (${Math.round(similarity * 100)}% similarity)`; } } } // Pattern match bonus const patternScore = getPatternMatchScore(field, pattern.type); const totalScore = semanticScore * 0.4 + patternScore * 0.3 + getPositionScore(field, columnIndex) * 0.15 + getFrequencyScore(field, normalizedHeader) * 0.15; if (totalScore >= 0.3) { matches.push({ field, score: totalScore, reason: reason || `Pattern matches: ${pattern.type}`, }); } } return matches.sort((a, b) => b.score - a.score); } // Jaro-Winkler similarity (simplified) function jaroWinklerSimilarity(s1: string, s2: string): number { // Use a library like 'string-similarity' in production // Simplified version here if (s1 === s2) return 1.0; if (s1.length === 0 || s2.length === 0) return 0.0; // Simple character overlap const longer = s1.length > s2.length ? s1 : s2; const shorter = s1.length > s2.length ? s2 : s1; const matches = shorter.split('').filter(c => longer.includes(c)).length; return matches / Math.max(s1.length, s2.length); } ``` ### Recommended Enhancements 1. **Add pattern detection**: Create `detectColumnPattern()` function in `xlsx-helper.ts` 2. **Expand knowledge base**: Add all synonyms from `EXPANDED_HEADER_KNOWLEDGE` to `HEADER_MAP` in `csv-validator.ts` 3. **Add validation feedback**: Show pattern detection results in UI components 4. **Highlight mismatches**: Alert when header name doesn't match data pattern 5. **Add confidence indicators**: Display confidence scores in mapping UI 6. **Store user corrections**: Learn from user mappings to improve future suggestions --- ## Practical Examples ### Example 1: Detecting Header in Spreadsheet with Metadata **Spreadsheet Structure:** ``` Row 0: "Relatório de Alunos - 2024" Row 1: (empty) Row 2: "Nome" | "E-mail" | "Telefone" | "CPF" Row 3: "João Silva" | "joao@email.com" | "11999999999" | "123.456.789-00" Row 4: "Maria Santos" | "maria@email.com" | "21988888888" | "987.654.321-00" ``` **Detection Process:** ```python rows = [ ["Relatório de Alunos - 2024"], [], ["Nome", "E-mail", "Telefone", "CPF"], ["João Silva", "joao@email.com", "11999999999", "123.456.789-00"], ["Maria Santos", "maria@email.com", "21988888888", "987.654.321-00"] ] result = detect_header_row(rows, max_rows_to_scan=15) # Result: # { # 'header_row_index': 2, # 'confidence': 0.85, # 'headers': ['Nome', 'E-mail', 'Telefone', 'CPF'], # 'candidates': [ # {'row_index': 2, 'score': 85.0, 'headers': ['Nome', 'E-mail', 'Telefone', 'CPF']}, # {'row_index': 0, 'score': 15.0, 'headers': ['Relatório de Alunos - 2024']} # ] # } ``` **Scoring Breakdown for Row 2:** - Text ratio: 100% (all cells are text) → 30 points - Keyword matches: 4 matches (nome, email, telefone, cpf) → 40 points - Fill ratio: 100% (all cells filled) → 20 points - Position bonus: Row 2 (< 5) → 10 points - **Total: 100 points → Confidence: 0.85** **Mappings with Pattern Analysis:** ```python # Analyze each column mappings = map_columns_intelligently( headers=['Nome', 'E-mail', 'Telefone', 'CPF'], rows=rows, header_row_index=2 ) # Results: # { # 'Nome': ('name', 0.95), # Exact keyword match + text pattern # 'E-mail': ('email', 0.90), # Synonym match + email pattern # 'Telefone': ('phone', 0.92), # Keyword match + phone pattern # 'CPF': ('cpf', 0.88) # Keyword match + CPF pattern # } ``` ### Example 2: Ambiguous Column Name **Column:** "Contato" **Sample Data:** ``` "Contato" "11999999999" "21988888888" "(11) 3333-4444" ``` **Analysis Process:** ```python column_header = "Contato" column_data = ["11999999999", "21988888888", "(11) 3333-4444"] column_index = 2 # Step 1: Semantic analysis semantic_score = find_synonym_match("Contato", "phone") # Result: 0.85 (contato ≈ telefone/phone) # Step 2: Pattern detection pattern = detect_data_pattern(column_data) # Result: {'type': 'phone', 'confidence': 0.90, 'format': 'brazilian'} # Step 3: Position context position_score = 0.70 # Column 2, expected phone position 2 # Step 4: Calculate total score total_score = ( semantic_score * 0.40 + # 0.85 * 0.40 = 0.34 pattern['confidence'] * 0.30 + # 0.90 * 0.30 = 0.27 position_score * 0.15 + # 0.70 * 0.15 = 0.105 0.6 * 0.15 # Frequency score = 0.09 ) # Total: 0.805 → Map to 'phone' with 80.5% confidence ``` **Result:** Map to `phone` (0.82 total score) **Alternative Scenario - If pattern was emails:** ```python column_data = ["joao@email.com", "maria@email.com", "pedro@email.com"] pattern = detect_data_pattern(column_data) # Result: {'type': 'email', 'confidence': 0.95} # Semantic score for email: 0.60 (contato can mean contact/email) total_score = 0.60 * 0.40 + 0.95 * 0.30 + 0.70 * 0.15 + 0.4 * 0.15 # Total: 0.72 → Map to 'email' with 72% confidence ``` ### Example 3: Type Detection by Pattern (Header Name Mismatch) **Column:** "ID" **Values:** ``` "ID" "123.456.789-00" "987.654.321-00" "111.222.333-44" ``` **Analysis:** ```python column_header = "ID" column_data = ["123.456.789-00", "987.654.321-00", "111.222.333-44"] # Step 1: Header name suggests numeric ID # But pattern analysis reveals CPF format pattern = detect_data_pattern(column_data) # Result: { # 'type': 'cpf', # 'confidence': 0.95, # 'match_ratio': 1.0, # 'format': 'brazilian_tax_id' # } # Step 2: Validate CPF format for value in column_data: is_valid = validate_cpf(value) # All values pass CPF validation # Step 3: Mapping decision # Header name: "ID" → suggests 'id' field (numeric) # But pattern: CPF → suggests 'cpf' field # Pattern confidence (0.95) > Semantic confidence (0.30) # Decision: Map to 'cpf' based on pattern evidence ``` **Result:** Map to `cpf` (0.95 confidence), not numeric `id` **User Feedback:** ``` ⚠️ Column "ID" appears to contain CPF values (confidence: 95%) Suggested mapping: CPF Reason: Data pattern matches CPF format, not numeric ID ``` ### Example 4: Low Confidence - Multiple Suggestions **Column:** "Dados" **Sample Data:** ``` "Dados" "João Silva" "Maria Santos" "Pedro Oliveira" ``` **Analysis:** ```python column_header = "Dados" column_data = ["João Silva", "Maria Santos", "Pedro Oliveira"] column_index = 0 # Pattern detection pattern = detect_data_pattern(column_data) # Result: {'type': 'text', 'confidence': 0.5} # Find all potential matches matches = find_best_match("Dados", EXPANDED_HEADER_KNOWLEDGE) # Results: # [ # ('name', 0.45), # Position suggests name, text pattern # ('email', 0.35), # Could be email if pattern matched # ('phone', 0.25), # Low match # ] # Generate suggestions with reasoning suggestions = [ { 'field': 'name', 'score': 0.45, 'confidence': 'medium', 'reasons': [ 'Position suggests name field (first column)', 'Text pattern matches name format', 'Values appear to be full names' ] }, { 'field': 'email', 'score': 0.35, 'confidence': 'low', 'reasons': [ 'Could be email if pattern matched', 'Position sometimes used for email' ] }, { 'field': '_skip', 'score': 0.30, 'confidence': 'low', 'reasons': [ 'Unclear mapping - consider skipping', 'May be metadata or notes column' ] } ] ``` **UI Action:** Show dropdown with suggestions for user selection **TypeScript Implementation:** ```typescript // In mapping UI component const suggestions = getMappingSuggestions("Dados", columnData, 0); // Display in UI ``` ### Example 5: Regional Variations **Column:** "Móvel" (Portuguese for mobile/phone) **Sample Data:** ``` "Móvel" "11999999999" "21988888888" "11987654321" ``` **Analysis:** ```python column_header = "Móvel" column_data = ["11999999999", "21988888888", "11987654321"] # Step 1: Semantic analysis with accent handling normalized_header = normalize_header("Móvel") # "movel" synonyms = EXPANDED_HEADER_KNOWLEDGE['phone'] # ['telefone', 'celular', 'móvel', 'movel', 'whatsapp', ...] # Check synonym chain: móvel → movel → celular → telefone → phone semantic_score = 0.88 # Strong match through synonym chain # Step 2: Pattern detection pattern = detect_data_pattern(column_data) # Result: {'type': 'phone', 'confidence': 0.92, 'format': 'brazilian'} # Step 3: Total score total_score = 0.88 * 0.40 + 0.92 * 0.30 + 0.70 * 0.15 + 0.8 * 0.15 # Total: 0.90 → Map to 'phone' with 90% confidence ``` **Result:** Map to `phone` (0.90 confidence) **Key Learning:** Accent normalization is crucial for Portuguese text matching. ### Example 6: Handling Multiple Data Blocks **Spreadsheet Structure:** ``` Row 0: "Relatório de Vendas - Q1 2024" Row 1: (empty) Row 2: "Nome" | "Email" | "Telefone" Row 3: "João" | "joao@..." | "11999..." Row 4: "Maria" | "maria@..." | "21988..." Row 5: (empty) Row 6: "Resumo" | "Total" | "Média" Row 7: "Vendas" | "15000" | "7500" ``` **Detection Strategy:** ```python # First pass: Detect main data header result1 = detect_header_row(rows[:10]) # Detects row 2 as header (confidence: 0.85) # Second pass: Detect summary section result2 = detect_header_row(rows[5:10]) # Detects row 6 as summary header (confidence: 0.75) # Handle separately main_data = extract_data_with_headers(rows, result1.header_row_index) summary_data = extract_data_with_headers(rows[5:], result2.header_row_index) ``` ### Example 7: Merged Cells / Multi-line Headers **Spreadsheet Structure:** ``` Row 0: "Dados Pessoais" (merged across columns A-D) Row 1: "Nome" | "Email" | "Telefone" | "CPF" Row 2: "João" | "joao@..." | "11999..." | "123..." ``` **Detection:** ```python # Detect multi-line header def detect_multiline_header(rows, start_row): # Check if next row also looks like header current_row = rows[start_row] next_row = rows[start_row + 1] if start_row + 1 < len(rows) else [] # If both rows are mostly text and similar structure if is_header_like(current_row) and is_header_like(next_row): return start_row + 1 # Header spans 2 rows return start_row # Result: Header spans rows 0-1, data starts at row 2 header_row_index = 1 # Use second row as actual column headers ``` ### Example 8: Error Detection and Validation **Column:** "CPF" **Values:** ``` "CPF" "123.456.789-00" ✓ Valid "987.654.321-00" ✓ Valid "111.111.111-11" ✗ Invalid (all same digits) "123.456.789" ✗ Invalid (missing check digits) ``` **Validation Process:** ```python column_data = [ "123.456.789-00", "987.654.321-00", "111.111.111-11", "123.456.789" ] validation_results = [] for i, value in enumerate(column_data): is_valid, normalized, error = validate_and_normalize_value(value, 'cpf') validation_results.append({ 'row': i + 2, # +2 for header row and 0-index 'value': value, 'valid': is_valid, 'normalized': normalized, 'error': error }) # Results: # [ # {'row': 2, 'valid': True, 'normalized': '12345678900', 'error': None}, # {'row': 3, 'valid': True, 'normalized': '98765432100', 'error': None}, # {'row': 4, 'valid': False, 'normalized': None, 'error': 'CPF with all same digits'}, # {'row': 5, 'valid': False, 'normalized': None, 'error': 'CPF missing check digits'} # ] # User feedback: # ⚠️ Row 4: Invalid CPF - all same digits # ⚠️ Row 5: Invalid CPF - missing check digits ``` ### Example 9: Learning from User Corrections **Scenario:** User corrects mapping "Contato" → `email` instead of suggested `phone` **Learning System:** ```typescript // Store user correction interface UserCorrection { columnHeader: string; suggestedField: string; userSelectedField: string; columnData: unknown[]; timestamp: number; } // Learn from correction function learnFromCorrection(correction: UserCorrection) { // Analyze why user chose different field const pattern = detectColumnPattern(correction.columnData); // If pattern matches user's choice better, update knowledge base if (pattern.type === 'email' && correction.userSelectedField === 'email') { // Add "contato" as email synonym in specific contexts updateKnowledgeBase('email', { context: 'when_pattern_is_email', synonyms: [...existingSynonyms, 'contato'] }); } // Track for future suggestions trackCorrection(correction); } ``` **Future Improvement:** Next time "Contato" with email pattern appears, suggest `email` first. --- ## Best Practices Summary 1. **Always use multi-factor analysis** - Don't rely on single method 2. **Sample data for pattern detection** - Use first 10-20 rows, not all data 3. **Provide fallbacks** - When confidence is low, show suggestions 4. **Validate patterns** - Use domain-specific validators (CPF, CEP, etc.) 5. **Normalize early** - Clean and normalize data during detection 6. **User feedback** - Show confidence scores and reasoning 7. **Learn from corrections** - Track user corrections to improve future mappings 8. **Handle edge cases** - Empty rows, merged cells, multiple data blocks --- ## Performance Considerations - **Limit scanning**: Only scan first 15 rows for headers - **Sample data**: Use first 10-20 rows for pattern detection - **Cache results**: Store detection results to avoid re-computation - **Lazy evaluation**: Only analyze columns that need mapping - **Parallel processing**: Analyze multiple columns simultaneously if possible