To modify existing Great Expectations code, you complete the following tasks:
To discuss your code change before you implement it, join the Great Expectations Slack community and make your suggestion in the #contributing channel.
To request a documentation change, or a change that doesn't require local testing, see the README in the docs
repository.
To create and submit a Custom Expectation to Great Expectations for consideration, see CONTRIBUTING_EXPECTATIONS in the great_expectations
repository.
To submit a custom package to Great Expectations for consideration, see CONTRIBUTING_PACKAGES in the great_expectations
repository.
-
A GitHub account.
-
A working version of Git on your computer. See Getting Started - Installing Git.
-
A new SSH (Secure Shell Protocol) key. See Generating a new SSH key and adding it to the ssh-agent.
-
The latest Python version installed and configured. See Python downloads.
-
Open a browser and go to the Great Expectations repository.
-
Click Fork and then Create Fork.
-
Click Code and then select the HTTPS or SSH tabs.
-
Copy the URL, open a Git terminal, and then run the following command:
git clone <url>
-
Run the following command to specify a new remote upstream repository that will be synced with the fork:
git remote add upstream git@github.com:great-expectations/great_expectations.git
-
Run the following command to create a branch for your changes:
git checkout -b <branch-name>
A virtual environment allows you to install an independent set of Python packages to their own site directory, isolated from the base/system install of Python.
Great Expectations requires a Python version from 3.8 to 3.11.
-
Run the following command to create a virtual environment named
gx_dev
:python3 -m venv <path_to_environments_folder>/gx_dev
-
Run the following command to activate the virtual environment:
source <path_to_environments_folder>/gx_dev/bin/activate
-
Run the following command to upgrade pip, the "package installer" for Python:
pip install --upgrade pip
-
Run the following command to create a virtual environment named
gx_dev
:conda create --name gx_dev
-
Run the following command to activate the virtual environment:
conda activate gx_dev
-
Run the following command from the root of the
great_expectations
repository to install Great Expectations in editable mode, with extra requirements for testing:pip install -c constraints-dev.txt -e ".[test]"
To specify other dependencies, add a comma after
test
and enter the dependency name(s). For example, ".[test, postgresql, trino]".The supported extra dependencies include:
arrow
,athena
,aws_secrets
,azure
,azure_secrets
,bigquery
,clickhouse
,cloud
,dremio
,excel
,gcp
,hive
,mssql
,mysql
,pagerduty
,postgresql
,redshift
,s3
,snowflake
,spark
,teradata
,test
,trino
,vertica
.Check below to see if any of your desired dependencies need system packages installed, before
pip install
. -
Optional. If you're using Amazon Redshift (
redshift
) or PostgreSQL (postgresql
), run one of the following commands to install thepg_config
executable, which is required to install thepsycopg2-binary
Python package:sudo apt-get install -y libpq-dev
or
brew install postgresql
-
Optional. If you're using Microsoft SQL Server (
mssql
) or Dremio (dremio
), run one of the following commands to installunixodbc
, which is required to install thepyodbc
Python package:sudo apt-get install -y unixodbc-dev
or
brew install unixodbc
If your Mac computer has an Apple Silicon chip, you might need to
- specify additional compiler or linker options. For example:
export LDFLAGS="-L/opt/homebrew/Cellar/unixodbc/[your version]/lib" export CPPFLAGS="-I/opt/homebrew/Cellar/unixodbc/[your version]/include"`
- reinstall pyodbc:
python -m pip install --force-reinstall --no-binary :all: pyodbc python -c "import pyodbc; print(pyodbc.version)"
-
install the ODBC 17 driver: https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/install-microsoft-odbc-driver-sql-server-macos?view=sql-server-ver15
-
see the following resources for more information:
-
Add
ulimit -n 4096
to the~/.zshrc
or~/.bashrc
files to preventOSError: [Errno 24] Too many open files
errors. -
Run the following command to confirm pandas and SQLAlchemy with SQLite tests are passing:
ulimit -n 4096 pytest -v
Some Great Expectations features require specific backends for local testing.
- Docker (PostgreSQL and MySQL). See Get Docker.
-
CD to
assets/docker/postgresql
in yourgreat_expectations
repository, and then and run the following command:docker-compose up -d
-
Run the following command to verify the PostgreSQL container is running:
docker-compose ps
The command should return results similar to the following example:
Name Command State Ports ——————————————————————————————————————————— postgresql_travis_db_1 docker-entrypoint.sh postgres Up 0.0.0.0:5432->5432/tcp
-
Run the following command to run tests on the PostgreSQL container:
pytest -v --postgresql
-
When you finish testing, run the following command to shut down the PostgreSQL container:
docker-compose down
Errors similar to the following are returned when you try to start the PostgreSQL container and another service is using port 5432:
psycopg2.OperationalError: could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: database "test_ci" does not exist
(Background on this error at: http://sqlalche.me/e/e3q8)
To resolve these errors, configure Docker to run on another port and confirm the server details are correct.
If another service is using port 3306, Docker might start the container but silently fail to set up the port.
-
CD to
assets/docker/mysql
in yourgreat_expectations
repository, and then and run the following command:docker-compose up -d
-
Run the following command to verify the MySQL container is running:
docker-compose ps
The command should return results similar to the following example:
Name Command State Ports ------------------------------------------------------------------------------------------ mysql_mysql_db_1 docker-entrypoint.sh mysqld Up 0.0.0.0:3306->3306/tcp, 33060/tcp
-
Run the following command to run tests on the MySQL container:
pytest -v --mysql
-
When you finish testing, run the following command to shut down the MySQL container:
docker-compose down
Use the following information to use Spark for code testing.
-
Java. See Java downloads.
-
The PATH or JAVA_HOME environment variables set correctly. See Setting Java variables in Windows or Setting Java variables in Linux.
On Mac, run the following commands to set the PATH and JAVA_HOME environment variables:
export JAVA_HOME=`/usr/libexec/java_home`
export PATH=$PATH:$JAVA_HOME/bin
When you install PySpark, Spark is also installed. See Spark Overview.
Run the following command to install PySpark and Apache Spark:
pip install pyspark
Great Expectations production code must be thoroughly tested, and you must perform unit testing on all branches of every method, including likely error states. Most new feature contributions should include multiple unit tests. Contributions that modify or extend existing features should include a test of the new behavior.
Most contributions do not require new integration tests, unless they change the Great Expectations CLI.
Great Expectations code is not tested against all SQL database types. Continuous Integration (CI) testing for SQL is limited to PostgreSQL, SQLite, MS SQL, and BigQuery.
To perform unit testing, run pytest
in the great_expectations
directory root. By default, tests are run against pandas
and sqlite
. You can use pytest
flags to test additional backends like postgresql
, spark
, and mssql
. For example, to run a test against PostgreSQL backend, you run pytest --postgresql
.
The following are the supported pytest
flags for general testing:
--spark
: Execute tests against Spark backend.--postgresql
: Execute tests against PostgreSQL.--mysql
: Execute tests against MySql.--mssql
: Execute tests against Microsoft SQL Server.--bigquery
: Execute tests against Google BigQuery (requires additional set up).--aws
: Execute tests against AWS resources such as Amazon S3, Amazon Redshift, and Athena (requires additional setup).
To skip all local backend tests (except pandas), run pytest --no-sqlalchemy
.
Testing can generate warning messages. These warnings are often caused by dependencies such as pandas or SQLAlchemy. Run pytest --no-sqlalchemy --disable-pytest-warnings
to suppress these warnings.
-
In your project, create a BigQuery dataset named
test_ci
and set the dataset default table expiration to.1
day. -
Run the following command to test your project with the
GE_TEST_BIGQUERY_PROJECT
andGE_TEST_BIGQUERY_DATASET
environment variables:GE_TEST_BIGQUERY_PROJECT=<YOUR_GOOGLE_CLOUD_PROJECT> GE_TEST_BIGQUERY_DATASET=test_ci pytest tests/test_definitions/test_expectations_cfe.py --bigquery
One of the most significant features of an Expectation is that it produces the same result on all supported execution environments including pandas, SQLAlchemy, and Spark. To accomplish this, Great Expectations encapsulates unit tests for Expectations as JSON files. These files are used as fixtures and executed using a specialized test runner that executes tests against all execution environments.
The following is the test fixture file structure:
{
"expectation_type" : "expect_column_max_to_be_between",
"datasets" : [{
"data" : {...},
"schemas" : {...},
"tests" : [...]
}]
}
Below datasets
are three entries: data
, schemas
, and tests
.
The data
parameter defines a DataFrame of sample data to apply Expectations against. The DataFrame is defined as a dictionary of lists, with keys containing column names and values containing lists of data entries. All lists within a dataset must have the same length. For example:
"data" : {
"w" : [1, 2, 3, 4, 5, 5, 4, 3, 2, 1],
"x" : [2, 3, 4, 5, 6, 7, 8, 9, null, null],
"y" : [1, 1, 1, 2, 2, 2, 3, 3, 3, 4],
"z" : ["a", "b", "c", "d", "e", null, null, null, null, null],
"zz" : ["1/1/2016", "1/2/2016", "2/2/2016", "2/2/2016", "3/1/2016", "2/1/2017", null, null, null, null],
"a" : [null, 0, null, null, 1, null, null, 2, null, null],
},
The schema
parameter defines the types to be used when instantiating tests against different execution environments, including different SQL dialects. Each schema is defined as a dictionary with column names and types as key-value pairs. If the schema isn’t specified for a given execution environment, Great Expectations introspects values and attempts to identify the schema. For example:
"schemas": {
"sqlite": {
"w" : "INTEGER",
"x" : "INTEGER",
"y" : "INTEGER",
"z" : "VARCHAR",
"zz" : "DATETIME",
"a" : "INTEGER",
},
"postgresql": {
"w" : "INTEGER",
"x" : "INTEGER",
"y" : "INTEGER",
"z" : "TEXT",
"zz" : "TIMESTAMP",
"a" : "INTEGER",
}
},
The tests
parameter defines the tests to be executed against the DataFrame. Each item in tests
must include title
, exact_match_out
, in
, and out
. The test runner executes the named Expectation once for each item, with the values in in
supplied as kwargs.
The test passes if the values in the expectation Validation Result correspond with the values in out
. If exact_match_out
is true, then every field in the Expectation output must have a corresponding, matching field in out
. If it’s false, then only the fields specified in out
need to match. For most use cases, false is a better result, because it allows narrower targeting of the relevant output.
suppress_test_for
is an optional parameter to disable an Expectation for a specific list of backends. For example:
"tests" : [{
"title": "Basic negative test case",
"exact_match_out" : false,
"in": {
"column": "w",
"result_format": "BASIC",
"min_value": null,
"max_value": 4
},
"out": {
"success": false,
"observed_value": 5
},
"suppress_test_for": ["sqlite"]
},
...
]
The test fixture files are stored in subdirectories of tests/test_definitions/
corresponding to the class of Expectation:
- column_map_expectations
- column_aggregate_expectations
- column_pair_map_expectations
- column_distributional_expectations
- multicolumn_map_expectations
- other_expectations
By convention, the name of the file is the name of the Expectation, with a .json suffix. Creating a new JSON file automatically adds the new Expectation tests to the test suite.
If you are implementing a new Expectation, but don’t plan to immediately implement it for all execution environments, you should add the new test to the appropriate lists in the candidate_test_is_on_temporary_notimplemented_list_v2_api
method within tests/test_utils.py
.
You can run just the Expectation tests with pytest tests/test_definitions/test_expectations.py
.
Test the performance of code changes to determine they perform as expected. BigQuery is required to complete performance testing.
-
Run the following command to set up the data for testing:
GE_TEST_BIGQUERY_PEFORMANCE_DATASET=<YOUR_GCP_PROJECT> tests/performance/setup_bigquery_tables_for_performance_test.sh
-
Run the following command to start the performance test:
pytest tests/performance/test_bigquery_benchmarks.py \ --bigquery --performance-tests \ -k 'test_taxi_trips_benchmark[1-True-V3]' \ --benchmark-json=tests/performance/results/`date "+%H%M"`_${USER}.json \ -rP -vv
Some benchmarks take significant time to complete. In the previous example, only the
test_taxi_trips_benchmark[1-True-V3]
benchmark runs. The output should appear similar to the following:--------------------------------------------------- benchmark: 1 tests ------------------------------------------------------ Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ----------------------------------------------------------------------------------------------------------------------------- test_taxi_trips_benchmark[1-True-V3] 5.0488 5.0488 5.0488 0.0000 5.0488 0.0000 0;0 0.1981 1 1 -----------------------------------------------------------------------------------------------------------------------------
-
Run the following command to compare the test results:
$ py.test-benchmark compare --group-by name tests/performance/results/initial_baseline.json tests/performance/results/*${USER}.json
The output should appear similar to the following:
---------------------------------------------------------------------------- benchmark 'test_taxi_trips_benchmark[1-True-V3]': 2 tests --------------------------------------------------------------------------- Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ test_taxi_trips_benchmark[1-True-V3] (initial_base) 5.0488 (1.0) 5.0488 (1.0) 5.0488 (1.0) 0.0000 (1.0) 5.0488 (1.0) 0.0000 (1.0) 0;0 0.1981 (1.0) 1 1 test_taxi_trips_benchmark[1-True-V3] (2114_work) 6.4675 (1.28) 6.4675 (1.28) 6.4675 (1.28) 0.0000 (1.0) 6.4675 (1.28) 0.0000 (1.0) 0;0 0.1546 (0.78) 1 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-
Optional. If your change is intended to improve performance, run the following command to generate the benchmark results that confirm the performance improvement:
$ tests/performance/run_benchmark_multiple_times.sh minimal_multithreading
The name for the tests should include the first argument provided to the script. In the previous example, this was
tests/performance/results/minimal_multithreading_*.json
.
-
Push your changes to the remote fork of your repository.
-
Create a pull request from your fork. See Creating a pull request from a fork.
-
Add a meaningful title and description for your pull request (PR). Provide a detailed explanation of what you changed and why. To help identify the type of issue you’re submitting, add one of the following identifiers to the pull request (PR) title:
-
[BUGFIX] for PRs that address minor bugs without changing behavior.
-
[FEATURE] for significant PRs that add a new feature likely to require being added to our feature maturity matrix.
-
[MAINTENANCE] for PRs that focus on updating repository settings or related changes.
-
[CONTRIB] for the contribution of Custom Expectations and supporting work into the
contrib/
directory. -
[HACKATHON] for submissions to an active Great Expectations Hackathon.
In the section for design review, include a description of any prior discussion or coordination on the features in the PR, such as mentioning the number of the issue where discussion has taken place. For example: Closes #123”, linking to a relevant discuss or slack article, citing a team meeting, or even noting that no discussion is relevant because the issue is small.
-
-
If this is your first Great Expectations contribution, you'll be prompted to complete the Contributor License Agreement (CLA). Complete the CLA and add
@cla-bot check
as a comment to the pull request (PR) to indicate that you’ve completed it. -
Wait for the Continuous Integration (CI) checks to complete and then correct any syntax or formatting issues.
A Great Expectations team member reviews, approves, and merges your PR. Depending on your GitHub notification settings, you'll be notified when there are comments or when your changes are successfully merged.
Great Expectations uses a stalebot
to automatically tag issues without activity as stale
, and closes them when a response is not received within a week. To prevent stalebot
from closing an issue, you can add the stalebot-exempt
tag.
Additionally, Great Expectations adds the following tags to indicate issue status:
-
The
help wanted
tag identifies useful issues that require help from community contributors to accelerate development. -
The
enhacement
andexpectation-request
tags identify new Great Expectations features that require additional investigation and discussion. -
The
good first issue
tag identifies issues that provide an introduction to the Great Expectations contribution process.