Skip to content

A powerful Python utility that transforms structured XML data into CSV format while preserving hierarchical relationships. Features advanced handling of complex document structures including nested sections, tables, and repeated elements.

Notifications You must be signed in to change notification settings

nirbhay221/xml-to-csv-processor

Repository files navigation

XML to CSV Converter

A robust Python tool for extracting structured data from XML documents and converting it to CSV format, with advanced handling of hierarchical structures, tables, and repeated sections.

Features

Hierarchical Data Extraction: Intelligently processes complex XML structures while preserving parent-child relationships Table Processing Engine: Specialized handling for HTML/XML tables with header extraction and alignment Section Pattern Recognition: Identifies and processes repeated section patterns as arrays/lists Advanced Node Classification System: Three-phase classification pipeline with heuristic algorithms Breadth-First Tree Construction: Complete document representation with path notation system Multi-Parser Support: Primary lxml-xml parser with fallback mechanisms for various XML formats Comprehensive Error Handling: Graceful handling of file access issues and malformed XML Hierarchical CSV Output: Dot notation and indexed notation for clear data relationships

Installation

Option 1: Direct Python Installation

Clone the repository

Install dependencies: pip install beautifulsoup4 lxml pandas

Option 2: Docker Installation

Build the Docker image: docker build -t xml-processor .

Usage

Basic Command

python xml_to_csv.py path/to/your/xml_file.xml [optional_output_path.csv]

With Logging

python xml_to_csv.py sample.xml output.csv logfile.log

Using Docker

docker run --rm -v "$(pwd):/data" xml-processor /data/sample.xml

Running Tests

python -m unittest test_xml_to_csv.py

Processing Workflow

  1. Input and Setup Phase

Input Preparation: Prepares XML files with proper structure and access permissions Execution Configuration: Handles command-line arguments for input, output, and logging paths Output Planning: Generates unique output filenames based on document structure hash if not specified

  1. Document Loading and Parsing

File Access: Opens and reads XML with UTF-8 encoding Parser Selection: Primary lxml-xml parser with fallback to standard lxml parser Validation: Checks for root element and basic XML structure Error Handling: Comprehensive error capture for file access and parsing issues

  1. Tree Construction

BFS Implementation: Builds a dictionary-based tree representation using breadth-first search Metadata Storage: Records tag name, node type, hierarchical level, and unique path notation Path Notation System: Creates unique identifiers (e.g., "0.1.2") for each node Relationship Mapping: Establishes parent-child-sibling relationships between nodes

  1. Node Classification

Initial Classification: Identifies nodes based on HTML semantic tags

Classification Pipeline:

Phase 1: Collects unclassified nodes via BFS traversal Phase 2: Applies heuristic classification algorithms Phase 3: Propagates section/subsection relationships

Element Recognition: Identifies sections, subsections, headings, content blocks, fields, lists, tables, forms, and specialized elements

  1. Table Processing

Table Detection: Locates all table elements in the document Header Extraction: Identifies and processes table headers Row Processing: Extracts and normalizes table row data Column Width Optimization: Calculates optimal column widths for text representation Formatting: Creates properly aligned text representation with separators Hierarchy Integration: Assigns unique identifiers based on document position

  1. Data Extraction

Extraction Pipeline Coordination: Manages the entire extraction process Field Processing: Extracts explicit key-value pairs with hierarchical context Section Content Processing: Handles regular sections and their content Repeated Pattern Handling: Processes arrays and lists with indexed notation Table Integration: Incorporates tables with proper sectional context Hierarchical Context Maintenance: Preserves proper nesting in naming conventions

  1. CSV Generation

File Creation: Generates CSV file with proper UTF-8 encoding Header Structure:

Simple fields use dot notation: "Section.Subsection.Field" Repeated sections use indexed notation: "Section[1].Subsection.Field" Tables have special naming: "Section.Table_1"

Data Organization: Writes all values in a single row with columns corresponding to hierarchical headers

  1. Output and Completion

Visual Feedback: Displays document structure for debugging purposes Confirmation: Provides completion message with CSV file location Process Termination: Ensures proper cleanup and resource release

Output Format Details The generated CSV uses a sophisticated hierarchical header structure:

Simple Fields: Section.Subsection.Field Nested Fields: Section.Subsection.SubSubsection.Field Repeated Sections: Section[1].Subsection.Field, Section[2].Subsection.Field Tables: Section.Table_1, Section.SubSection.Table_2 Table Cells: Section.Table_1.Row[1].Column[2]

Error Handling and Logging

File Access Issues: Catches and reports permissions and I/O errors Parsing Failures: Implements parser fallback mechanisms with detailed error messaging Structure Validation: Verifies critical document structures with clear error messages Logging System: Optional detailed logging of the entire process for debugging

About

A powerful Python utility that transforms structured XML data into CSV format while preserving hierarchical relationships. Features advanced handling of complex document structures including nested sections, tables, and repeated elements.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published