-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Is your feature request related to a problem? Please describe
Implement Windowed Bulk Scoring Approach for ApproximateBooleanQuery
Problem Statement
The current ResumableDISI approach for ApproximateBooleanQuery outlined in this RFC has a fundamental issue: BKD trees traverse documents by point values, not docID order. This causes the conjunction to miss lower docIDs that appear later in the traversal sequence, resulting in fewer hits than expected.
Root Cause
- BKD trees organize leaves by spatial/temporal proximity (point values), not docIDs
- Lower docIDs can appear later in BKD traversal order
- When ResumableDISI stops at 10k documents, it may miss lower docIDs that come later
- ConjunctionDISI positions past these missed docIDs, permanently excluding potential matches
Example Issue
BKD Traversal Order: [737k, 825k, 847k, ..., 698k, 699k, 700k, ...]
↑ ↑
Early stop Dense cluster (missed!)
ConjunctionDISI Position: 737348 → Can never go back to find 698k cluster
Describe the solution you'd like
Implemented Solution: Windowed Approach
Based on Alternative 1 from the RFC, we implemented a windowed bulk scoring approach:
Key Features
- Window Size Heuristic:
min(minCost, maxCost/2^7)
- Progressive Expansion: 3x growth (10k → 30k → 90k → 270k)
- Fresh Iterators: Each window creates completely new iterators (no BKD state)
- CancellableBulkScorer Compatible: Returns lower docIDs when window expands
Implementation Details
- ApproximateBooleanScorerSupplier: Manages windowed expansion logic
- ApproximatePointRangeQuery: Modified to accept dynamic window sizes via
getWithSize()
- Fresh BKD Traversal: Each window starts from root, avoiding ResumableDISI and BKD state
Future Optimization Opportunity
The current windowed approach has repeated work overhead - each window re-processes documents from the beginning.
Proposed Enhancement
Store a copy of the most recent DISI as an instance variable and use public void visit(DocIdSetIterator iterator)
in ApproximatePointRangeQuery to:
- Resume from last position: Add the stored DISI to the currently constructed one using BKDState
- Eliminate repeated collection: Only collect new documents beyond the previous window
- Minimize overhead: Only leapfrogging conjunction work remains, no document re-processing
Benefits
- Performance: Eliminates redundant document collection across windows
- Efficiency: Maintains windowed approach benefits while reducing computational overhead
- Scalability: Better performance for queries requiring large windows
Advanced Expansion Strategies
The current 3x expansion is arbitrary and may not be optimal for all use cases. Future enhancements could include:
1. Cost-Based Fast Expansion
Expand iterators in order of their costs for dense hit scenarios:
- Example: Iterators with costs
(200k, 800k, 10M, 30M)
- Strategy: Expand by 200k first, then 800k, then 10M, etc.
- Benefit: Dense datasets find hits quickly with smaller expansions
2. Hit-Based Adaptive Expansion
Scale expansion based on conjunction hit density:
- Initial:
max(10k, minCost/2^7)
- Adaptive Scaling:
- 1k hits found → expand by 10x
- 5k hits found → expand by 2x
- High hit density → smaller expansions
- Low hit density → larger expansions
3. User-Configurable Strategies
Allow users to choose expansion strategy based on their data characteristics:
- Fast Expansion: Cost-ordered for dense datasets
- Adaptive Expansion: Hit-density based scaling
- Fixed Expansion: Current 3x approach (default)
- Custom: User-defined expansion factors
Implementation Considerations
- Strategy Selection: Query-level or index-level configuration
- Performance Metrics: Track expansion efficiency for auto-tuning
- Fallback Logic: Default to fixed expansion if adaptive strategies fail
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status