The purpose of this project is to design and implement a standardized C CSV data analysis program capable of:
- Reading data from one or more CSV files with minimal assumptions on delimiting or formatting.
- Preprocessing data fields (e.g., removing extra whitespace, repeated delimiters, inconsistent hyphens, etc.).
- Identifying numeric (plottable) fields vs. non-numeric fields.
- Extracting numeric data fields into separate files for further statistical processing, histogram generation, normality tests, etc.
- Computing standard statistics (mean, std. dev., skewness) and normality tests (Anderson-Darling).
- Automatically generating MATLAB scripts to visualize histograms and normal fits for each numeric field. The final outcome is a reproducible pipeline for CSV data ingestion, cleaning, analysis, and scripting for easy plotting.
2.1 Math Analysis Core mathematical components include:
- Statistical computations:
- Mean, standard deviation, skewness, Anderson–Darling normality test.
- Freedman-Diaconis rule (based on IQR) to determine optimal bin width for histograms.
- Sorting (e.g., radix sort on double-precision bit patterns).
- Histogram binning:
- Freedman–Diaconis rule to compute the bin width.
- Counting occurrences in each bin to form histograms.
- Reading/writing CSV:
- Parsing CSV data into arrays.
- Handling missing fields and non-numeric tokens. An optional MATLAB .m script automatically overlays normal distribution curves (using computed μ and σ) on the histograms.
2.2 Program Structure and Flow The program is divided into modular components:
- CommonDefinitions.c/h Contains constants, macros (array sizes, date/time format arrays).
- GeneralUtilities.c/h Memory allocation and array-copy functions. May include sorting (e.g., radix_sort_doubles).
- StringUtilities.c/h Functions for trimming strings, identifying delimiters, handling repeated commas, etc.
- FileUtilities.c/h File and directory operations (counting lines, reading contents, creating directories, etc.).
- DataExtraction.c/h Extracts CSV fields, identifies missing vs. valid values, separates numeric (plottable) from non-numeric fields.
- Integrators.c/h (Optional) Example numerical integrators: trapezoidal, Romberg, or Euler’s method, etc.
- StatisticalMethods.c/h Statistical computations (mean, std. dev., skewness, A–D normality test, histogram binning).
- PlottingMethods.c/h Generation of MATLAB scripts (histogram overlays, normal PDF curves).
- DataAnalysis.c/h Higher-level orchestration: reading CSV, calling formatting steps, writing numeric fields to disk, performing full analysis.
- DebuggingUtilities.c/h Functions that help print internal data structures and debug information.
- main.c The entry point orchestrates reading CSV files, calling data extraction, analysis, and generating plots. 2.3 Algorithmic Overview Below is a simplified pseudocode illustrating the high-level approach: Function PreprocessData(filePathName): fileContents ← read_file_contents(filePathName) delimiter ← identify_delimiter(fileContents) extract_and_format_data_set(fileContents, delimiter) write_data_set(fileContents, filePathName) return directory with separate fields
Function AnalyzeData(preparedDirectory): perform_statistical_analysis_on_plottable_data(preparedDirectory) generate_matlab_scripts() An alternative version with more detail: Function ProcessDataSetForAnalysis(filePath): lines ← ReadFile(filePath) delimiter ← IdentifyDelimiter(lines) preprocessed ← ExtractAndFormatData(lines, delimiter) plottableDir ← WriteDataSet(preprocessed, filePath, …) Print("Data set preprocessed and stored in plottableDir") return plottableDir
Function PerformFullAnalysisAndModeling(plottableDir): fileList ← GetFilePathnamesInDirectory(plottableDir) for file in fileList: data ← LoadDoubleData(file) compute mean, std, skewness, histogram, AD_stat WriteStats("field_analysis.txt", …) WriteHistogram("field_histogram.txt", …) GenerateMATLABscripts(…) 2.4 Logic Flow and Workflow
- Initial Capture Read the entire CSV file line by line into a string array.
- Delimiter Determination Inspect likely delimiters (commas, tabs, etc.) across sample lines to guess the file’s delimiter.
- Extraction/Formatting
- Tokenize lines.
- Clean repeated delimiters.
- Trim whitespace.
- Convert date/time fields to Unix time (if applicable).
- Identify numeric vs. non-numeric tokens.
- Plottable Data Writing Store each numeric field in its own FieldName.txt file in a _Plottable_Fields directory.
- Statistical Analysis
- Calculate mean, standard deviation, skewness.
- Perform Anderson–Darling (A–D) normality test.
- Compute Freedman–Diaconis bin width and histogram.
- MATLAB Script Generation Produce .m scripts for each numeric field, plotting a histogram overlayed with a normal PDF.
Several test CSV files were used to verify correctness:
- weather_measurements.csv: Contains numeric columns (e.g., temperature, humidity) and textual date/time fields.
- particle_experiment.csv: Mixture of mass, charge, and string descriptors. Expected Results:
- A _Plottable_Fields_Directory containing numeric columns as individual .txt files.
- A _Full_Analysis_Results/ directory with:
- A histogram for each numeric column.
- Statistical summaries (mean, std. dev., skewness, A–D statistic).
- MATLAB scripts that plot histograms with an overlaid normal curve.
By modularizing the program into distinct C source/include files, we gain maintainability, clarity, and extensibility. The core pipeline is:
- Parsing/Cleaning CSV Data
- Identifying Numeric vs. Non-numeric Fields
- Computing Standard Statistics and Normality Testing
- Generating MATLAB Scripts for Easy Visualization This structure can be extended to more complex transformations, different file formats, or advanced plotting routines.