-
Notifications
You must be signed in to change notification settings - Fork 61
llm based pk detector added #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets
check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm
. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique
check func. Yes, we can add this as as another profile, and use it for rules generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.
Key changes:
- Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
- Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
- Integrates seamlessly with existing profiling workflow while maintaining backward compatibility
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/databricks/labs/dqx/llm/pk_identifier.py |
Core LLM detection engine with table metadata analysis and duplicate validation |
src/databricks/labs/dqx/profiler/profiler.py |
Enhanced profiler with LLM detection methods and lazy import handling |
src/databricks/labs/dqx/profiler/generator.py |
Added primary key rule generation with LLM-specific metadata |
src/databricks/labs/dqx/profiler/runner.py |
Updated runner to support table-based profiling with PK detection |
src/databricks/labs/dqx/config.py |
Added LLM configuration fields to ProfilerConfig |
src/databricks/labs/dqx/check_funcs.py |
Implemented is_primary_key validation function |
tests/unit/test_llm_based_pk_identifier.py |
Comprehensive unit tests with graceful dependency handling |
tests/integration/test_pk_detection_integration.py |
End-to-end integration tests for the complete workflow |
src/databricks/labs/dqx/llm/demo.py |
Usage demonstration showing optional LLM activation |
src/databricks/labs/dqx/llm/README.md |
Detailed documentation with examples and best practices |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
❌ 335/336 passed, 1 flaky, 1 failed, 2 skipped, 3h29m10s total ❌ test_e2e_workflow: pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=44e61503-95e0-4676-ad4b-5325ef53b894, reason=INACTIVITY_TIMEOUT]. (requestId=282dff53-84a6-475d-9b06-15de584c23af) (10m45.881s)
Flaky tests:
Running from acceptance #2424 |
Changes
LLM based Pk detector
Linked issues
#484
Resolves #..
Tests