ch4evaluation.tex


\chapter{Evaluation}
\label{chap:evaluation}

One of the goals outlined in \autoref{chap:analysis} was for the Python implementation of the Ataccama Expression Language to be viable for real-world applications. As a metric for this viability, we have chosen to evaluate the performance of the Python implementation against similar solutions, Soda Core and Great Expectations. This chapter presents the performance testing methodology, the test environment setup, the test cases, the performance analysis, and a discussion of the results.

\section{Introduction to Performance Testing}

\subsection{Purpose of Testing}

Performance testing is crucial in assessing the viability of the Python implementation of the Ataccama Expression Language, particularly in ensuring it can efficiently and effectively handle data quality rules within Python environments, i.e. comparably to similar solutions. This testing is not about matching the performance of similar solutions - Soda Core and Great Expectations -  but rather ensuring that the Python prototype is sufficiently efficient for practical use. The aim is to determine if the Python implementation performs within acceptable limits, where a slowdown by a factor of up to 10 times compared to the similar solutions might be considered tolerable for deployment, but a 1000 times slowdown would indicate serious efficiency issues that could render the solution impractical. By establishing these performance benchmarks, we can validate that the Python implementation meets minimum requirements for real-world applications, ensuring it is a viable alternative for data engineers who require programmatic access to Ataccama's data quality tools.

\subsection{Testing Framework}

\subsubsection{Tools and Setup}

The performance testing utilizes a structured approach where a specific data quality rule—represented in three different formats: an original Ataccama expression, a Soda Core implementation, and a Great Expectations setup—is executed across a range of dataset sizes. This method allows for a direct comparison of how well the Python implementation scales with increasing data volumes, a critical factor in many data engineering tasks.

\subsubsection{Methodology}

\paragraph{Dataset Sizes} Tests are conducted on datasets of varying sizes, starting from 10 records and scaling up to 1 or 10 million records depending on test case complexity. This range is chosen to simulate different real-world scenarios, from small, manageable datasets to large-scale data processing tasks.

\paragraph{Execution Repetition} Each test is repeated ten times to ensure consistency and reliability in the results. This repetition helps mitigate any anomalies or outliers that could affect the accuracy of the performance assessment.
The first repetition is considered a warm-up to allow the Python interpreter to optimize the code before the actual performance metrics are recorded. This approach ensures that the performance measurements are based on the optimized execution of the code - while Python is often perceived as an interpreted language that executes code directly from its high-level syntax, in practice, Python first compiles the source code into bytecode, which is a lower-level, platform-independent representation of the code. This bytecode is then executed by the Python interpreter. During the warm-up phase, the Python interpreter can perform several optimizations on this bytecode, such as type specializations, loop unrolling, conditional simplifications, and inline caching. 

\paragraph{Process Isolation} Each test run is executed in a freshly started Python process to avoid any potential interference from memory leaks, memory layout, residual data, or other artifacts from previous executions. This approach ensures that each test is conducted in a clean state, providing accurate and unbiased performance measurements.

\paragraph{Measurement Metrics} The key performance metric collected during the tests is execution time. This metric provides a direct measure of how long it takes for the Python implementation to process the data quality rule on datasets of different sizes, which is essential for assessing the viability and scalability of the Python implementation.


\section{Test Environment Setup}

%     % Hardware Specifications: Describe the computer systems on which the tests are conducted, including processor speeds, memory, and network configurations if relevant.
%     % Software Configuration: Detail the versions of Python, Java, and other relevant software tools or libraries used during the tests.

This section details the hardware and software specifications of the test environment to ensure that the performance results are reproducible and relevant to typical data engineering scenarios.

\subsection{Hardware Specifications}

    The performance tests were conducted on a M1 MacBook Pro with the following specifications:

    \begin{itemize}
        \item Processor: Apple M1 10-core CPU
        \item Memory: 32 GB
        \item Operating System: macOS Sonoma 14.4.1
    \end{itemize}

    \subsection{Software Configuration}

    \paragraph{Python Version} Python 3.10 is used for all Python-related tests.
    
    \paragraph{Testing Frameworks} For execution time measurements, the \texttt{timeit} module is used.
   

\section{Test Cases}

%     % Selection Rationale: Explain why specific test cases were chosen to reflect real-world usage scenarios that data engineers might encounter.
%     % Test Case Descriptions: Briefly introduce each test case before a detailed analysis.

\paragraph{Test Cases Descriptions and Rationale}

The choice of test cases for evaluating the performance and viability of the reimplemented Ataccama Expression Language in Python is designed to reflect a range of real-world scenarios that data engineers commonly encounter. These test cases are selected to cover a spectrum of complexity, from relatively simple checks to more involved, multi-condition validations that interact with external data sources and complex logic. This selection ensures that the testing not only assesses basic functionality but also gauges the performance under more demanding processing conditions. 

There are two test cases selected for the performance evaluation representing different levels of complexity in data quality rule validation. This scope allows for a comprehensive assessment of the Python implementation's performance across a range of scenarios, from basic data validation to more intricate customer data checks. The test cases are designed to simulate common data quality tasks that data engineers might encounter in their daily work, providing a realistic basis for evaluating the Python implementation's efficiency and effectiveness in handling these tasks.

Both of the test cases include enumerating the failed records, which is a common requirement in \glsxtrshort{dqm} tasks. This feature is essential for identifying and addressing data quality issues efficiently, making it a key aspect of the performance evaluation.

\subsection{Test Case 1: Simple Continent Validation}

\paragraph{Description} This test involves evaluating a relatively straightforward expression that checks if the value of a 'continent' field does not belong to a predefined list of continent names. 

\paragraph{Relevance} This test case is chosen for its simplicity and its commonality in data validation tasks. It represents typical scenarios where fields within datasets are validated against a fixed set of allowable values.  Only a single condition is used wherein lies the simplicity of the case. Testing this case helps verify the Python implementation’s ability to handle basic inclusion checks efficiently, a frequent requirement in data cleaning and standardization processes. It also serves as a check for the system’s ability to process simple expressions quickly and accurately, possibly revealing any performance overhead and bottlenecks in the compilation process itself.

\subsubsection{Expressions Implementation}

\begin{verbatim}
( NOT ( lower(continent) in \{ "asia", "africa", "europe", 
"north america", "south america", "oceania", "antarctica" \} ) )
\end{verbatim}

\subsubsection{Great Expectations Implementation}

\begin{verbatim}
{
  "data_asset_type": "Dataset",
  "expectation_suite_name": "default",
  "expectations": [
    \{
      "expectation_type": "expect_column_values_to_be_in_set",
      "kwargs": \{
        "column": "continent_lower",
        "mostly": 1,
        "value_set": [
          "asia",
          "africa",
          "europe",
          "north america",
          "south america",
          "oceania",
          "antarctica"
        ]
      \},
      "meta": \{\}
    \}
  ],
  "ge_cloud_id": null,
  "meta": \{
    "great_expectations_version": "0.18.12"
  \}
\}
\end{verbatim}

\subsubsection{Soda Core Implementation}

\begin{verbatim}
checks for continents:
    - invalid_count(continent_lower) = 0:
          valid values: ["asia", "africa", "europe", 
                         "north america", "south america", 
                         "oceania", "antarctica"]
\end{verbatim}


\subsection{Test Case 2: Complex Customer Validation}

\paragraph{Description} This more complex test case. The complexity lies in applying multiple conditions to validate customer data, involving null checks, file lookups, regular expression pattern matching.

Below is the Ataccama Expression Language expression for the test case:

\begin{verbatim}
  (customernumber IS NULL)
  OR ( NOT ( isInFile(contactlastname, "surnames.lkp") ) ) 
  OR (NOT isInFile(contactfirstname, "acc_first_names.lkp"))
  OR (contactfirstname IS NULL)
  OR ( NOT ( isInFile(contactfirstname, "first_names.lkp") ) )
  OR (indexOf(upper(email), "NOREPLY") is not null)
  OR ( NOT ( matches(@"^[^@]+@[A-z0-9._-]+\.+[A-z._-]+$", email) ) )
  OR (email IS NULL)
  OR (trim(email) is in {'NULL', 'Null', 'null', '.', ',',
   '-', '_', '', 'N/A', 'n/a', 'na', 'NA'})
\end{verbatim}

The expressions references three lookup files, each of which contains a list of valid values for the corresponding field - more than 1M rows in total.

The Great Expectations and Soda Core implementations are omitted for brevity, but they follow the same pattern as the simple continent validation test case. 

\paragraph{Relevance} This test case is designed to simulate the complex validation processes often required in customer data management, where multiple fields need to be verified against various conditions. It tests the system’s capacity to execute multiple, diverse operations — from static reference data lookups to regular expressions and string manipulations — which are typical in scenarios involving data integration and compliance checks. It provides a robust challenge to the system, testing its performance and accuracy under load and complex logic conditions.


\section{Performance Analysis}

%     Metrics Collected: List the metrics that will be evaluated, such as execution time, memory usage, and CPU utilization.
%     Comparative Analysis: Present a detailed comparison of the performance metrics obtained from the Python implementation against those from the original Java version.

For the purpose of this performance analysis, the primary metric that will be evaluated is execution time. Execution time is chosen as the focus metric because it directly impacts the user experience and operational efficiency in real-world applications of the Python implementation of the Ataccama Expression Language. The efficiency with which the system processes data validations directly affects throughput and responsiveness, which are critical factors for data processing workflows.


\subsection{Expected Outcomes}

This analysis aims to validate that the Python implementation, while possibly slower the original implementation and similar solutions, remains within an acceptable range of performance efficiency. If the Python version performs within a factor of up to 10-20 times slower than similar solutions, it may still be considered viable for scenarios where Python’s ease of use and integration capabilities and the provided feature set provide significant value over raw execution speed. 

By clearly outlining and adhering to this analytical framework, the performance analysis will provide stakeholders with the critical information needed to make informed decisions about the viability and further development considerations of the Python implementation of the Ataccama Expression Language.

\subsection{Performance Results}

\subsubsection{Test Case 1: Simple Continent Validation}

Table \ref{tab:time-continents} shows the execution time comparison for test case n. 1: Continent Validation. The results are also visualized in Figure \ref{fig:time-comparison-continents}.

\begin{table}[h] 
    \footnotesize
    \centering 
    \caption{Execution Time Comparison for test case n. 1: Continent Validation} 
    \label{tab:time-continents} 
    \input{result-analysis/plots/execution_time_comparison_continents}
\end{table}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=1.0\columnwidth]{result-analysis/plots/execution_time_comparison_continents.png}
  \caption{Execution Time Comparison for test case n. 1: Continent Validation}
  \label{fig:time-comparison-continents}
\end{figure}


More important than the absolute execution time is the relative performance compared to the similar solutions. Table \ref{tab:time-continents} shows the relative execution times comparison of similar solution relative to our solution for test case n. 1: Continent Validation. The results are also visualized in Figure \ref{fig:time-comparison-continents-relative}.

\begin{table}[h] 
  \centering 
  \caption{Relative execution time comparison to Soda Core for test case n. 1: Continent Validation} 
  \label{tab:time-continents-relative} 
  \input{result-analysis/plots/relative_time_comparison_continents.tex}
\end{table}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=1.0\columnwidth]{result-analysis/plots/relative_speedup_comparison_continents.png}
  \caption{Execution times in test case n. 1 of similar solution relative to the Python implementation}
  \label{fig:time-comparison-continents-relative}
\end{figure}

In this simple test case, even for 10M records, the Python implementation is within an acceptable range of performance efficiency compared to the similar solutions. The execution time is slower than both Soda Core and Greater Expectations, but the difference is no more than 4x. 


\subsubsection{Test Case 2: Complex Customer Validation}

Table \ref{tab:time-customers} shows the execution time comparison for test case n. 2: Customers Validation. The results are also visualized in Figure \ref{fig:time-comparison-customers}.

\begin{table}[h] 
    \footnotesize
    \centering 
    \caption{Execution time comparison for test case n. 2: Customer Validation} 
    \label{tab:time-customers} 
    \input{result-analysis/plots/execution_time_comparison_customers}
\end{table}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=1.0\columnwidth]{result-analysis/plots/execution_time_comparison_customers.png}
  \caption{Execution time comparison for test case n. 2: Customer Validation}
  \label{fig:time-comparison-customers}
\end{figure}


More important than the absolute execution time is the relative performance compared to similar solutions. Table \ref{tab:time-customers} shows the relative execution times of similar solutions in comparison to our solution for test case n. 2: Customer Validation. The results are also visualized in Figure \ref{fig:time-comparison-customers-relative}.

\begin{table}[h] 
  \centering 
  \caption{Execution times in test case n. 2. of similar solution relative to the Python implementation} 
  \label{tab:time-customers-relative} 
  \input{result-analysis/plots/relative_time_comparison_customers.tex}
\end{table}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=1.0\columnwidth]{result-analysis/plots/relative_speedup_comparison_customers.png}
  \caption{Execution times in test case n. 2. of similar solution relative to the Python implementation}
  \label{fig:time-comparison-customers-relative}
\end{figure}

In the more complex test case involving reference data lookups, even for 1M records, the Python implementation is within an acceptable range of performance efficiency compared to the similar solutions. The execution time is slower than both Soda Core and Greater Expectations, but the difference is below 10x. 

\section{Discussion}

The performance analysis of the Python implementation of the Ataccama Expression Language reveals that the system is within an acceptable range of performance efficiency for real-world applications. The Python version, while up to 6-times slower on 1M records than similar tested solutions Soda Core and Great Expectations, remains competitive in terms of execution time, even for large datasets. The performance results indicate that the Python implementation can handle data quality rules efficiently and effectively, making it a viable alternative for data engineers who require programmatic access to Ataccama's data quality tools.

The performance gap is acceptable, especially considering the more comprehensive feature set it provides compared to the other two solutions and pluggability with data quality rules in Ataccama. 

The Python implementation's performance is likely to improve further with optimizations and enhancements, such as parallel processing, caching, and JIT compilation, which could help narrow the gap with the similar solutions. These optimizations can be explored in the further development of the Python implementation but could also come from further improvements in the Python interpreter itself.