feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

abhi8893 · 2024-12-20T11:38:23Z

Description

The current implementation of spark.SparkDataset does not support reading data from bigquery using a SQL query.
Adding this functionality in spark.SparkDataset may not comply with kedro_datasets design principles as it requires making filepath as an optional argument.
Further, similar to pandas.GBQQueryDataset, the spark.GBQQueryDataset is also a read-only dataset, hence it's a more suited implementation to maintain the overall design of datasets.

Development notes

To test the dataset, the following is the manual way:

Create a GCP Project (project_id)
Create a test dataset inside BigQuery as <project)_id>.<test_dataset>
1. Create a test materialization dataset inside BigQuery as <project)_id>.<test_mat_dataset>
Create a test table inside the test dataset as <project)_id>.<test_dataset>.<test_table>
Create a service account with following permissions:
Download service account credentials json key

>>> from kedro_datasets.spark import GBQQueryDataset
>>> import pyspark.sql as sql
>>>
>>> # Define your SQL query
>>> sql = "SELECT * FROM `<project)_id>.<test_dataset>.<test_table>`"
>>>
>>> # Initialize dataset
>>> dataset = GBQQueryDataset(
...     sql=sql,
...     materialization_dataset="your_dataset",
...     materialization_project="your_project",  # optional
...     credentials=dict(file="/path/to/your/credentials.json"),
... )
>>>
>>> # Load data
>>> df = dataset.load()
>>>
>>> # Example output
>>> df.show()

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

datajoely · 2024-12-20T12:10:44Z

So this is a great first start - the credentials resolution looks complicated, but also well thought out.

I think we'd need to see some tests for this to go in as is, you can take some inspiration from the pandas equivalent

That also being said we could look at contributing this to the experimental part of [kedro-datasets-experimental](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md#experimental-datasets) which has a lower threshold for admission.

astrojuanlu

Thank you @abhi8893 , this is very promising! We will look at this soon

abhi8893 · 2024-12-20T12:43:06Z

Thanks @datajoely , @astrojuanlu

Credentials Handling

Yes, the credentials handling is a bit different than the rest of the datasets. Also, I do think that spark.SparkDataset does not support setting GCP service account credentials, rather always relies on setting credentials for the parsed filesystem (fsspec).

Earlier, I have been reading bigquery tables using spark.SparkDataset using the following methods of auth:

my_dataset:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: /tmp/my_table.parquet # Just because it's an non optional arg with type `str`
  load_args:
    table: "<my_project_id>.<my_dataset>.<my_table>"

Work on a GCP resource which already has the required authorization via assigned service account
OR, download service account json key, and set env var

export GOOGLE_APPLICATION_CREDENTIALS=`/path/to/credentials.json

OR, set the following as spark conf

spark.hadoop.google.cloud.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: /path/to/credentials.json

With above dataset, I wanted to allow passing credentials directly to the dataset. But it seems, we may have to standardize it a little bit for all other kedro datasets for this GCP case.

Implementing tests

And let me take a look at how tests can be implemented for this. Initial thoughts: Since this doesn't involve a bigquery client, hence the method of mocking (as in pandas.GBQQueryDataset may not be relevant here).

Moving to experimental

For moving this to experimental, let me know and I'll lift and shift this to kedro_datasets_experimental namespace :)

abhi8893 · 2025-01-01T10:48:55Z

Added some tests.

Related to credentials, one thing that isn't implemented here is reading the SQL query from a text file, which maybe placed anywhere, hence reading the SQL text file itself may require credentials. This is already implemented in pandas.GBQQueryDataset. I haven't implemented providing filepath argument here in spark.GBQQueryDataset.

So this can necessisate providing 2 types of credentials? Or maybe just one could suffice, and use it authenticate with BigQuery as well as any storage backend?

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

…ching spark credentials Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 · 2025-01-03T17:25:51Z

So, now I have implemented passing separate credentials for bigquery and filesystem (since sql file maybe placed on any filesystem which may require specific auth).

I feel the credentials handling is a bit unstandardized across datasets, however I have done what suits this case best.

Found a few kedro discussions on credentials:

Part of what makes it tricky is:

Making sure Python API and YAML API are compatible, and the interface doesn't burdern user with overly specific schema
Authentication and Authorization is a BIG topic with alot of nuances. I think it's best for kedro to leave the heavylifting of credentials management to third party client libraries, and have as less custom code for this.
- Anything more custom should be implemented by the user (but kedro may have guidance on it as it matures more)
The where do we keep it, and how do we pass it dilemma. Often times it isn't clear as to where the credentials should be managed in kedro (local, or as environment variable etc)

Happy to contribute to standardizing credentials handling in kedro as we move forward. However I do realise, this would require quite a bit of research (and user surveys) to perfect.

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

merelcht · 2025-01-16T13:02:43Z

Hi @abhi8893 , thank you for this contribution! I haven't used BigQuery myself, but generally the code looks fine to me. There's still some failing builds, which will need to pass before we can put this up for a vote. Otherwise, like @datajoely suggested, this could be contributed as experimental, meaning the build checks are less strict.

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 · 2025-02-22T15:27:49Z

@merelcht Finally got around to picking this up back again. Now the test coverage is 100% and all tests pass. If the kedro team takes a vote to put this to experimental, then will shift it there 🙂

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

Copilot

Pull Request Overview

This pull request introduces a new read‐only dataset, GBQQueryDataset, to load data from Google BigQuery into Spark DataFrames using a SQL query. Key changes include the implementation of the dataset class with support for reading a SQL query from a string or file, handling various BigQuery credential formats, and adding comprehensive tests.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
kedro_datasets/spark/spark_gbq_dataset.py	Added the implementation of GBQQueryDataset with error handling and credential processing
kedro_datasets/spark/init.py	Registered the new GBQQueryDataset in the spark module
tests/spark/test_spark_gbq_dataset.py	Added tests covering dataset initialization, credential handling, load behavior, and error conditions

Comments suppressed due to low confidence (2)

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py:86

The docstring refers to 'SparkGBQDataSet' while the class is named GBQQueryDataset. Please update the docstring for consistency.

Creates a new instance of ``SparkGBQDataSet`` pointing to a specific table in Google BigQuery.

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py:126

Similarly, the concatenated error message strings here are missing a space between sentences. Adding a space would improve readability.

raise DatasetError(
                "'sql' and 'filepath' arguments cannot both be empty." 
                "Please provide a sql query or path to a sql query file."
            )

Copilot · 2025-03-11T12:29:43Z

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py

+        """
+        if sql and filepath:
+            raise DatasetError(
+                "'sql' and 'filepath' arguments cannot both be provided."


The concatenated error message strings lack a space between sentences, leading to unclear output. Consider adding a space at the end of the first string.

Suggested change

"'sql' and 'filepath' arguments cannot both be provided."

"'sql' and 'filepath' arguments cannot both be provided. "

merelcht

I left a couple more minor comments, but otherwise this looks good 👍

Approving with one caveat: I haven't been able to test this on GCP, but the code is easy to follow and written cleanly.

merelcht · 2025-03-11T12:25:31Z

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py

+        >>> df.show()
+    """
+
+    _VALID_CREDENTIALS_KEYS = {"base64", "file", "json"}


I don't see this used anywhere, what is it for?

merelcht · 2025-03-11T12:27:08Z

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py

+            self._sql = sql
+            self._filepath = None
+        else:
+            # TODO: Add protocol specific handling cases for different filesystems.


There's still TODO here.

ankatiyar · 2025-03-11T14:43:23Z

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py

+                `load_args={"credentialsFile": "/path/to/your/credentials.json"}`
+
+                When passing as a json object:
+                NOT SUPPORTED


It says here that JSON credentials are not supported but the code below does accept json, is this meant to be updated?

ankatiyar · 2025-03-11T14:51:59Z

kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py

+
+        self._metadata = metadata
+
+    def _get_spark_bq_credentials(self) -> dict[str, str]:


Sorry I'm not super familiar with Google Big query but I'm wondering why the credential handling is different from the pandas.GBQTableDataset? Would the same way not work?

abhi8893 changed the title ~~feat(datasets): Implement spark.GBQQueryDataset for reading spark dataframes from BigQuery using SQL query~~ feat(datasets): Implement spark.GBQQueryDataset for reading data from BigQuery as a spark dataframe using SQL query Dec 20, 2024

astrojuanlu reviewed Dec 20, 2024

View reviewed changes

abhi8893 marked this pull request as ready for review January 1, 2025 10:45

abhi8893 added 8 commits January 1, 2025 16:21

implement spark.GBQQueryDataset

43b5d82

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

mypy fixes

5dcb1f8

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix __init__ import point

cc2deef

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix dataset name in .save method in spark.GBQQueryDataset

db28619

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

add check for valid credential keys in spark.GBQQueryDataset when fet…

1c7470a

…ching spark credentials Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix error messages in _get_spark_credentials in spark.GBQQueryDataset

7c9ca76

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix save method error to kedro.io.DatasetError in spark.GBQQueryDataset

4d7091a

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

Add tests for spark.GBQQueryDataset

cc6c930

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 force-pushed the feat/datasets/implement-spark-gbq-dataset branch from 53555b6 to cc6c930 Compare January 1, 2025 10:51

abhi8893 added 4 commits January 1, 2025 18:03

change dummy_save_dataset as a spark dataframe

4685384

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

minor cleanup

35e3c2a

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix datasets annot in init

a2be311

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

implement option to pass bq, fs credentials and fs args

6908fab

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix yaml api and python api docs

0a51fdb

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 force-pushed the feat/datasets/implement-spark-gbq-dataset branch from 8eb407e to c7877bd Compare February 22, 2025 09:33

abhi8893 added 2 commits February 22, 2025 15:15

fix ruff formatting

6c3d3a4

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

add spark_gbq dataset in ignore list for tests requiring cloud setups

6e27d06

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 force-pushed the feat/datasets/implement-spark-gbq-dataset branch from c7877bd to 6e27d06 Compare February 22, 2025 09:47

abhi8893 added 4 commits February 22, 2025 15:33

fix _fs_protocol attr

02bb5bb

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

add tests for raising error based on sql or filepath args

f4f4403

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

add tests for sql filepath load

a5b1ca4

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix ruff format

b7f73aa

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

abhi8893 added 3 commits February 22, 2025 20:32

get coverage to 100% - add tests for viewsEnabled Py4JJavaError

f9c227e

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

fix mypy issue

99f8cb3

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

Merge branch 'main' into feat/datasets/implement-spark-gbq-dataset

065b8dc

retrigger CI

1faaa16

Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>

merelcht requested a review from Copilot March 11, 2025 12:28

Copilot AI reviewed Mar 11, 2025

View reviewed changes

merelcht approved these changes Mar 11, 2025

View reviewed changes

merelcht requested review from ravi-kumar-pilla and ankatiyar March 11, 2025 12:43

ankatiyar reviewed Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

abhi8893 commented Dec 20, 2024 •

edited

Loading

datajoely commented Dec 20, 2024

astrojuanlu left a comment

abhi8893 commented Dec 20, 2024 •

edited

Loading

abhi8893 commented Jan 1, 2025

abhi8893 commented Jan 3, 2025

merelcht commented Jan 16, 2025

abhi8893 commented Feb 22, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI Mar 11, 2025

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

merelcht left a comment

merelcht Mar 11, 2025

merelcht Mar 11, 2025

ankatiyar Mar 11, 2025

ankatiyar Mar 11, 2025

	"'sql' and 'filepath' arguments cannot both be provided."
	"'sql' and 'filepath' arguments cannot both be provided. "


		self._metadata = metadata

		def _get_spark_bq_credentials(self) -> dict[str, str]:

feat(datasets): Implement spark.GBQQueryDataset for reading data from BigQuery as a spark dataframe using SQL query #971

Are you sure you want to change the base?

feat(datasets): Implement spark.GBQQueryDataset for reading data from BigQuery as a spark dataframe using SQL query #971

Conversation

abhi8893 commented Dec 20, 2024 • edited Loading

Description

Development notes

Checklist

datajoely commented Dec 20, 2024

astrojuanlu left a comment

Choose a reason for hiding this comment

abhi8893 commented Dec 20, 2024 • edited Loading

Credentials Handling

Implementing tests

Moving to experimental

abhi8893 commented Jan 1, 2025

abhi8893 commented Jan 3, 2025

merelcht commented Jan 16, 2025

abhi8893 commented Feb 22, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI Mar 11, 2025

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

merelcht Mar 11, 2025

Choose a reason for hiding this comment

merelcht Mar 11, 2025

Choose a reason for hiding this comment

ankatiyar Mar 11, 2025

Choose a reason for hiding this comment

ankatiyar Mar 11, 2025

Choose a reason for hiding this comment

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

abhi8893 commented Dec 20, 2024 •

edited

Loading

abhi8893 commented Dec 20, 2024 •

edited

Loading

abhi8893 commented Feb 22, 2025 •

edited

Loading