Skip to content

Conversation

jominjohny
Copy link
Contributor

Changes

LLM based Pk detector

Linked issues

#484

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests

@jominjohny jominjohny requested a review from a team as a code owner August 25, 2025 05:39
@jominjohny jominjohny requested review from grusin-db and removed request for a team August 25, 2025 05:39
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique check func. Yes, we can add this as as another profile, and use it for rules generation.

@mwojtyczka mwojtyczka requested a review from Copilot August 29, 2025 10:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.

Key changes:

  • Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
  • Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
  • Integrates seamlessly with existing profiling workflow while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/databricks/labs/dqx/llm/pk_identifier.py Core LLM detection engine with table metadata analysis and duplicate validation
src/databricks/labs/dqx/profiler/profiler.py Enhanced profiler with LLM detection methods and lazy import handling
src/databricks/labs/dqx/profiler/generator.py Added primary key rule generation with LLM-specific metadata
src/databricks/labs/dqx/profiler/runner.py Updated runner to support table-based profiling with PK detection
src/databricks/labs/dqx/config.py Added LLM configuration fields to ProfilerConfig
src/databricks/labs/dqx/check_funcs.py Implemented is_primary_key validation function
tests/unit/test_llm_based_pk_identifier.py Comprehensive unit tests with graceful dependency handling
tests/integration/test_pk_detection_integration.py End-to-end integration tests for the complete workflow
src/databricks/labs/dqx/llm/demo.py Usage demonstration showing optional LLM activation
src/databricks/labs/dqx/llm/README.md Detailed documentation with examples and best practices

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

github-actions bot commented Sep 5, 2025

❌ 335/336 passed, 1 flaky, 1 failed, 2 skipped, 3h29m10s total

❌ test_e2e_workflow: pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=44e61503-95e0-4676-ad4b-5325ef53b894, reason=INACTIVITY_TIMEOUT]. (requestId=282dff53-84a6-475d-9b06-15de584c23af) (10m45.881s)
pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=44e61503-95e0-4676-ad4b-5325ef53b894, reason=INACTIVITY_TIMEOUT]. (requestId=282dff53-84a6-475d-9b06-15de584c23af)
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
14:36 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
14:36 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
14:36 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+2120250919143605
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/dashboards'
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/589554587133066/runs/404833260809252
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 with state: RunResultState.SUCCESS
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 duration: 0:10:08.259000 (2025-09-19 14:36:22.427000+00:00 thru 2025-09-19 14:46:30.686000+00:00)
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
14:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/prepare.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/finalize.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
14:46 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/checks.yml' in the workspace.
14:36 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
14:36 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
14:36 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+2120250919143605
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
14:36 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
14:36 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/dashboards'
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=589554587133066
14:36 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
14:36 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/589554587133066/runs/404833260809252
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 with state: RunResultState.SUCCESS
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 404833260809252 duration: 0:10:08.259000 (2025-09-19 14:36:22.427000+00:00 thru 2025-09-19 14:46:30.686000+00:00)
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
14:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/prepare.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+2120250919143605 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/logs/e2e/run-404833260809252-0/finalize.log
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
14:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
14:46 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.vMuy/checks.yml' in the workspace.
14:46 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.3+2120250919143605 from https://DATABRICKS_HOST
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=945736945213779, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=676532413148522, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=589554587133066, as it is no longer needed
14:46 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

  • 🤪 test_e2e_workflow (12m13.531s)

Running from acceptance #2424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants