llm based pk detector added #543

jominjohny · 2025-08-25T05:39:40Z

Changes

LLM based Pk detector

Linked issues

#484

Resolves #..

Tests

manually tested
added unit tests
added integration tests
added end-to-end tests

mwojtyczka

Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique check func. Yes, we can add this as as another profile, and use it for rules generation.

coverage-unit.xml

src/databricks/labs/dqx/check_funcs.py

src/databricks/labs/dqx/config.py

src/databricks/labs/dqx/llm/README.md

src/databricks/labs/dqx/llm/pk_identifier.py

tests/unit/test_llm_based_pk_identifier.py

src/databricks/labs/dqx/profiler/profiler.py

src/databricks/labs/dqx/profiler/runner.py

src/databricks/labs/dqx/profiler/profiler.py

src/databricks/labs/dqx/check_funcs.py

Copilot

Pull Request Overview

This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.

Key changes:

Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
Integrates seamlessly with existing profiling workflow while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/databricks/labs/dqx/llm/pk_identifier.py`	Core LLM detection engine with table metadata analysis and duplicate validation
`src/databricks/labs/dqx/profiler/profiler.py`	Enhanced profiler with LLM detection methods and lazy import handling
`src/databricks/labs/dqx/profiler/generator.py`	Added primary key rule generation with LLM-specific metadata
`src/databricks/labs/dqx/profiler/runner.py`	Updated runner to support table-based profiling with PK detection
`src/databricks/labs/dqx/config.py`	Added LLM configuration fields to ProfilerConfig
`src/databricks/labs/dqx/check_funcs.py`	Implemented is_primary_key validation function
`tests/unit/test_llm_based_pk_identifier.py`	Comprehensive unit tests with graceful dependency handling
`tests/integration/test_pk_detection_integration.py`	End-to-end integration tests for the complete workflow
`src/databricks/labs/dqx/llm/demo.py`	Usage demonstration showing optional LLM activation
`src/databricks/labs/dqx/llm/README.md`	Detailed documentation with examples and best practices

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/databricks/labs/dqx/profiler/runner.py

github-actions · 2025-09-05T05:13:58Z

❌ 335/336 passed, 1 flaky, 1 failed, 2 skipped, 3h29m10s total

❌ test_e2e_workflow: pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=44e61503-95e0-4676-ad4b-5325ef53b894, reason=INACTIVITY_TIMEOUT]. (requestId=282dff53-84a6-475d-9b06-15de584c23af) (10m45.881s)

pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=44e61503-95e0-4676-ad4b-5325ef53b894, reason=INACTIVITY_TIMEOUT]. (requestId=282dff53-84a6-475d-9b06-15de584c23af)
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
14:36 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
14:36 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
14:36 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+2120250919143605
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/dashboards'
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/589554587133066/runs/404833260809252
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 with state: RunResultState.SUCCESS
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 duration: 0:10:08.259000 (2025-09-19 14:36:22.427000+00:00 thru 2025-09-19 14:46:30.686000+00:00)
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
14:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/prepare.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/finalize.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
14:46 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/checks.yml' in the workspace.
14:36 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
14:36 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
14:36 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+2120250919143605
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/dashboards'
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/589554587133066/runs/404833260809252
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 with state: RunResultState.SUCCESS
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 duration: 0:10:08.259000 (2025-09-19 14:36:22.427000+00:00 thru 2025-09-19 14:46:30.686000+00:00)
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
14:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/prepare.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/finalize.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
14:46 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/checks.yml' in the workspace.
14:46 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.3+2120250919143605 from https://DATABRICKS_HOST
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=945736945213779, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=676532413148522, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=589554587133066, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

🤪 test_e2e_workflow (12m13.531s)

_{Running from acceptance #2424}

jominjohny requested a review from a team as a code owner August 25, 2025 05:39

jominjohny requested review from grusin-db and removed request for a team August 25, 2025 05:39

mwojtyczka requested changes Aug 25, 2025

View reviewed changes