Skip to content

A popular data engineering pattern is to use Azure Databricks and Delta Lake format for preparing data for analytics. Currently, there is no official support for reading this data in Azure Machine Learning. This repository highlights a proposed work-around to leverage the underlying Parquet files as a FileDataset in Azure Machine Learning.

License

Notifications You must be signed in to change notification settings

ManufacturingCSU/delta-azure-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

delta-azure-ml

A popular data engineering pattern is to use Azure Databricks and Delta Lake format for preparing data for analytics. Currently, there is no official support for reading this data in Azure Machine Learning. This repository highlights a proposed work-around to leverage the underlying Parquet files as a FileDataset in Azure Machine Learning.

As answered in the FAQ section of the Delta Lake and Delta Engine guide, external reads of delta tables is possible. However, there is a critical step for ensuring the legitimacy of the data while accessing it without being able to understand the transaction log. You can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of any table. This operations puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

Prerequisites

❗ To ensure validity of Delta data, you must run VACUUM with a retention of ZERO HOURS.

Step 1: Import notebooks

  1. From the Azure Databricks workspace, import the notebook located here: databricks/create-delta-tables.dbc and fill in the following values.

    • Client ID
    • Client Secret
    • Tenant ID
    • Container name
    • Storage account name

  1. From the Azure Machine Learning compute instance, import the notebook located here: azureml/read-delta-data.ipynb and fill in the following values:

    • Datastore name (same name as when you registered ADLS Gen2 storage account in prerequisites)

Step 2: Create a sample Delta Table in Databricks

  • This notebook will walk through various sections of creating a sample Delta Table. At various points through the Databricks notebook, switch to the Azure ML notebook to see how the data changes.

Step 3: Read sample Delta data in Azure ML

⚠️ WARNING
Without running VACUUM with retention of ZERO HOURS there is a risk the data will not be accurate.
  • This notebook will walk through various sections of creating a FileDataset that points to the Parquet files stored in ADLS Gen2.
  • Before using this work-around, be sure that you run VACUUM with a retention of ZERO HOURS.

About

A popular data engineering pattern is to use Azure Databricks and Delta Lake format for preparing data for analytics. Currently, there is no official support for reading this data in Azure Machine Learning. This repository highlights a proposed work-around to leverage the underlying Parquet files as a FileDataset in Azure Machine Learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published