A popular data engineering pattern is to use Azure Databricks and Delta Lake format for preparing data for analytics. Currently, there is no official support for reading this data in Azure Machine Learning. This repository highlights a proposed work-around to leverage the underlying Parquet files as a FileDataset in Azure Machine Learning.
As answered in the FAQ section of the Delta Lake and Delta Engine guide, external reads of delta tables is possible. However, there is a critical step for ensuring the legitimacy of the data while accessing it without being able to understand the transaction log. You can use VACUUM
with a retention of ZERO HOURS
to clean up any stale Parquet files that are not currently part of any table. This operations puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.
- Azure Data Lake Store Gen2 storage account
- Azure Databricks workspace and cluster
- Mount the ADLS Gen2 storage account to your Databricks cluster
- Azure Machine Learning workspace
- Azure Machine Learning compute instance
- Register ADLS Gen2 storage account as a Datastore in Azure Machine Learning workspace
❗ To ensure validity of Delta data, you must run VACUUM with a retention of ZERO HOURS . |
---|
-
From the Azure Databricks workspace, import the notebook located here:
databricks/create-delta-tables.dbc
and fill in the following values.- Client ID
- Client Secret
- Tenant ID
- Container name
- Storage account name
-
From the Azure Machine Learning compute instance, import the notebook located here:
azureml/read-delta-data.ipynb
and fill in the following values:- Datastore name (same name as when you registered ADLS Gen2 storage account in prerequisites)
- This notebook will walk through various sections of creating a sample Delta Table. At various points through the Databricks notebook, switch to the Azure ML notebook to see how the data changes.
Without running VACUUM with retention of ZERO HOURS there is a risk the data will not be accurate. |
- This notebook will walk through various sections of creating a FileDataset that points to the Parquet files stored in ADLS Gen2.
- Before using this work-around, be sure that you run
VACUUM
with a retention ofZERO HOURS
.