-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Implement spark.GBQQueryDataset
for reading data from BigQuery as a spark dataframe using SQL query
#971
base: main
Are you sure you want to change the base?
Conversation
spark.GBQQueryDataset
for reading spark dataframes from BigQuery using SQL queryspark.GBQQueryDataset
for reading data from BigQuery as a spark dataframe using SQL query
So this is a great first start - the credentials resolution looks complicated, but also well thought out. I think we'd need to see some tests for this to go in as is, you can take some inspiration from the pandas equivalent That also being said we could look at contributing this to the experimental part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @abhi8893 , this is very promising! We will look at this soon
Thanks @datajoely , @astrojuanlu Credentials HandlingYes, the credentials handling is a bit different than the rest of the datasets. Also, I do think that Earlier, I have been reading bigquery tables using my_dataset:
type: spark.SparkDataset
file_format: bigquery
filepath: /tmp/my_table.parquet # Just because it's an non optional arg with type `str`
load_args:
table: "<my_project_id>.<my_dataset>.<my_table>"
export GOOGLE_APPLICATION_CREDENTIALS=`/path/to/credentials.json
spark.hadoop.google.cloud.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: /path/to/credentials.json With above dataset, I wanted to allow passing credentials directly to the dataset. But it seems, we may have to standardize it a little bit for all other kedro datasets for this GCP case. Implementing testsAnd let me take a look at how tests can be implemented for this. Initial thoughts: Since this doesn't involve a bigquery client, hence the method of mocking (as in Moving to experimentalFor moving this to experimental, let me know and I'll lift and shift this to |
Added some tests. Related to credentials, one thing that isn't implemented here is reading the SQL query from a text file, which maybe placed anywhere, hence reading the SQL text file itself may require credentials. This is already implemented in So this can necessisate providing 2 types of credentials? Or maybe just one could suffice, and use it authenticate with BigQuery as well as any storage backend? |
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
…ching spark credentials Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
53555b6
to
cc6c930
Compare
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
So, now I have implemented passing separate credentials for I feel the credentials handling is a bit unstandardized across datasets, however I have done what suits this case best. Found a few kedro discussions on credentials: Part of what makes it tricky is:
Happy to contribute to standardizing credentials handling in kedro as we move forward. However I do realise, this would require quite a bit of research (and user surveys) to perfect. |
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Hi @abhi8893 , thank you for this contribution! I haven't used BigQuery myself, but generally the code looks fine to me. There's still some failing builds, which will need to pass before we can put this up for a vote. Otherwise, like @datajoely suggested, this could be contributed as experimental, meaning the build checks are less strict. |
8eb407e
to
c7877bd
Compare
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
c7877bd
to
6e27d06
Compare
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
@merelcht Finally got around to picking this up back again. Now the test coverage is 100% and all tests pass. If the kedro team takes a vote to put this to experimental, then will shift it there 🙂 |
Signed-off-by: Abhishek Bhatia <bhatiaabhishek8893@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces a new read‐only dataset, GBQQueryDataset, to load data from Google BigQuery into Spark DataFrames using a SQL query. Key changes include the implementation of the dataset class with support for reading a SQL query from a string or file, handling various BigQuery credential formats, and adding comprehensive tests.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
kedro_datasets/spark/spark_gbq_dataset.py | Added the implementation of GBQQueryDataset with error handling and credential processing |
kedro_datasets/spark/init.py | Registered the new GBQQueryDataset in the spark module |
tests/spark/test_spark_gbq_dataset.py | Added tests covering dataset initialization, credential handling, load behavior, and error conditions |
Comments suppressed due to low confidence (2)
kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py:86
- The docstring refers to 'SparkGBQDataSet' while the class is named GBQQueryDataset. Please update the docstring for consistency.
Creates a new instance of ``SparkGBQDataSet`` pointing to a specific table in Google BigQuery.
kedro-datasets/kedro_datasets/spark/spark_gbq_dataset.py:126
- Similarly, the concatenated error message strings here are missing a space between sentences. Adding a space would improve readability.
raise DatasetError(
"'sql' and 'filepath' arguments cannot both be empty."
"Please provide a sql query or path to a sql query file."
)
""" | ||
if sql and filepath: | ||
raise DatasetError( | ||
"'sql' and 'filepath' arguments cannot both be provided." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concatenated error message strings lack a space between sentences, leading to unclear output. Consider adding a space at the end of the first string.
"'sql' and 'filepath' arguments cannot both be provided." | |
"'sql' and 'filepath' arguments cannot both be provided. " |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple more minor comments, but otherwise this looks good 👍
Approving with one caveat: I haven't been able to test this on GCP, but the code is easy to follow and written cleanly.
>>> df.show() | ||
""" | ||
|
||
_VALID_CREDENTIALS_KEYS = {"base64", "file", "json"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this used anywhere, what is it for?
self._sql = sql | ||
self._filepath = None | ||
else: | ||
# TODO: Add protocol specific handling cases for different filesystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still TODO here.
`load_args={"credentialsFile": "/path/to/your/credentials.json"}` | ||
|
||
When passing as a json object: | ||
NOT SUPPORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says here that JSON credentials are not supported but the code below does accept json
, is this meant to be updated?
|
||
self._metadata = metadata | ||
|
||
def _get_spark_bq_credentials(self) -> dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not super familiar with Google Big query but I'm wondering why the credential handling is different from the pandas.GBQTableDataset
? Would the same way not work?
Description
spark.SparkDataset
does not support reading data from bigquery using a SQL query.spark.SparkDataset
may not comply withkedro_datasets
design principles as it requires makingfilepath
as an optional argument.pandas.GBQQueryDataset
, thespark.GBQQueryDataset
is also aread-only
dataset, hence it's a more suited implementation to maintain the overall design of datasets.Development notes
To test the dataset, the following is the manual way:
project_id
)<project)_id>.<test_dataset>
<project)_id>.<test_mat_dataset>
<project)_id>.<test_dataset>.<test_table>
Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
file