XML to CSV Converter
A robust Python tool for extracting structured data from XML documents and converting it to CSV format, with advanced handling of hierarchical structures, tables, and repeated sections.
Features
Hierarchical Data Extraction: Intelligently processes complex XML structures while preserving parent-child relationships Table Processing Engine: Specialized handling for HTML/XML tables with header extraction and alignment Section Pattern Recognition: Identifies and processes repeated section patterns as arrays/lists Advanced Node Classification System: Three-phase classification pipeline with heuristic algorithms Breadth-First Tree Construction: Complete document representation with path notation system Multi-Parser Support: Primary lxml-xml parser with fallback mechanisms for various XML formats Comprehensive Error Handling: Graceful handling of file access issues and malformed XML Hierarchical CSV Output: Dot notation and indexed notation for clear data relationships
Installation
Option 1: Direct Python Installation
Clone the repository
Install dependencies: pip install beautifulsoup4 lxml pandas
Option 2: Docker Installation
Build the Docker image: docker build -t xml-processor .
Usage
Basic Command
python xml_to_csv.py path/to/your/xml_file.xml [optional_output_path.csv]
With Logging
python xml_to_csv.py sample.xml output.csv logfile.log
Using Docker
docker run --rm -v "$(pwd):/data" xml-processor /data/sample.xml
Running Tests
python -m unittest test_xml_to_csv.py
Processing Workflow
- Input and Setup Phase
Input Preparation: Prepares XML files with proper structure and access permissions Execution Configuration: Handles command-line arguments for input, output, and logging paths Output Planning: Generates unique output filenames based on document structure hash if not specified
- Document Loading and Parsing
File Access: Opens and reads XML with UTF-8 encoding Parser Selection: Primary lxml-xml parser with fallback to standard lxml parser Validation: Checks for root element and basic XML structure Error Handling: Comprehensive error capture for file access and parsing issues
- Tree Construction
BFS Implementation: Builds a dictionary-based tree representation using breadth-first search Metadata Storage: Records tag name, node type, hierarchical level, and unique path notation Path Notation System: Creates unique identifiers (e.g., "0.1.2") for each node Relationship Mapping: Establishes parent-child-sibling relationships between nodes
- Node Classification
Initial Classification: Identifies nodes based on HTML semantic tags
Classification Pipeline:
Phase 1: Collects unclassified nodes via BFS traversal Phase 2: Applies heuristic classification algorithms Phase 3: Propagates section/subsection relationships
Element Recognition: Identifies sections, subsections, headings, content blocks, fields, lists, tables, forms, and specialized elements
- Table Processing
Table Detection: Locates all table elements in the document Header Extraction: Identifies and processes table headers Row Processing: Extracts and normalizes table row data Column Width Optimization: Calculates optimal column widths for text representation Formatting: Creates properly aligned text representation with separators Hierarchy Integration: Assigns unique identifiers based on document position
- Data Extraction
Extraction Pipeline Coordination: Manages the entire extraction process Field Processing: Extracts explicit key-value pairs with hierarchical context Section Content Processing: Handles regular sections and their content Repeated Pattern Handling: Processes arrays and lists with indexed notation Table Integration: Incorporates tables with proper sectional context Hierarchical Context Maintenance: Preserves proper nesting in naming conventions
- CSV Generation
File Creation: Generates CSV file with proper UTF-8 encoding Header Structure:
Simple fields use dot notation: "Section.Subsection.Field" Repeated sections use indexed notation: "Section[1].Subsection.Field" Tables have special naming: "Section.Table_1"
Data Organization: Writes all values in a single row with columns corresponding to hierarchical headers
- Output and Completion
Visual Feedback: Displays document structure for debugging purposes Confirmation: Provides completion message with CSV file location Process Termination: Ensures proper cleanup and resource release
Output Format Details The generated CSV uses a sophisticated hierarchical header structure:
Simple Fields: Section.Subsection.Field Nested Fields: Section.Subsection.SubSubsection.Field Repeated Sections: Section[1].Subsection.Field, Section[2].Subsection.Field Tables: Section.Table_1, Section.SubSection.Table_2 Table Cells: Section.Table_1.Row[1].Column[2]
Error Handling and Logging
File Access Issues: Catches and reports permissions and I/O errors Parsing Failures: Implements parser fallback mechanisms with detailed error messaging Structure Validation: Verifies critical document structures with clear error messages Logging System: Optional detailed logging of the entire process for debugging