delta-azure-ml

A popular data engineering pattern is to use Azure Databricks and Delta Lake format for preparing data for analytics. Currently, there is no official support for reading this data in Azure Machine Learning. This repository highlights a proposed work-around to leverage the underlying Parquet files as a FileDataset in Azure Machine Learning.

As answered in the FAQ section of the Delta Lake and Delta Engine guide, external reads of delta tables is possible. However, there is a critical step for ensuring the legitimacy of the data while accessing it without being able to understand the transaction log. You can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of any table. This operations puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

Prerequisites

❗ To ensure validity of Delta data, you must run `VACUUM` with a retention of `ZERO HOURS`.

Step 1: Import notebooks

From the Azure Databricks workspace, import the notebook located here: databricks/create-delta-tables.dbc and fill in the following values.
- Client ID
- Client Secret
- Tenant ID
- Container name
- Storage account name

From the Azure Machine Learning compute instance, import the notebook located here: azureml/read-delta-data.ipynb and fill in the following values:
- Datastore name (same name as when you registered ADLS Gen2 storage account in prerequisites)

Step 2: Create a sample Delta Table in Databricks

This notebook will walk through various sections of creating a sample Delta Table. At various points through the Databricks notebook, switch to the Azure ML notebook to see how the data changes.

Step 3: Read sample Delta data in Azure ML

⚠️ WARNING
Without running `VACUUM` with retention of `ZERO HOURS` there is a risk the data will not be accurate.

This notebook will walk through various sections of creating a FileDataset that points to the Parquet files stored in ADLS Gen2.
Before using this work-around, be sure that you run VACUUM with a retention of ZERO HOURS.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
azureml		azureml
databricks		databricks
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

delta-azure-ml

Prerequisites

Step 1: Import notebooks

Step 2: Create a sample Delta Table in Databricks

Step 3: Read sample Delta data in Azure ML

About

Releases

Packages

Languages

License

ManufacturingCSU/delta-azure-ml

Folders and files

Latest commit

History

Repository files navigation

delta-azure-ml

Prerequisites

Step 1: Import notebooks

Step 2: Create a sample Delta Table in Databricks

Step 3: Read sample Delta data in Azure ML

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages