Skip to content

[Approximation Framework] Multifold Improvement in Multi-Clause Boolean Query (Window Scoring Approach) #19045

@sawansri

Description

@sawansri

Is your feature request related to a problem? Please describe

Implement Windowed Bulk Scoring Approach for ApproximateBooleanQuery

Problem Statement

The current ResumableDISI approach for ApproximateBooleanQuery outlined in this RFC has a fundamental issue: BKD trees traverse documents by point values, not docID order. This causes the conjunction to miss lower docIDs that appear later in the traversal sequence, resulting in fewer hits than expected.

Root Cause

  • BKD trees organize leaves by spatial/temporal proximity (point values), not docIDs
  • Lower docIDs can appear later in BKD traversal order
  • When ResumableDISI stops at 10k documents, it may miss lower docIDs that come later
  • ConjunctionDISI positions past these missed docIDs, permanently excluding potential matches

Example Issue

BKD Traversal Order: [737k, 825k, 847k, ..., 698k, 699k, 700k, ...]
                      ↑                        ↑
                   Early stop              Dense cluster (missed!)

ConjunctionDISI Position: 737348 → Can never go back to find 698k cluster

Describe the solution you'd like

Implemented Solution: Windowed Approach

Based on Alternative 1 from the RFC, we implemented a windowed bulk scoring approach:

Key Features

  1. Window Size Heuristic: min(minCost, maxCost/2^7)
  2. Progressive Expansion: 3x growth (10k → 30k → 90k → 270k)
  3. Fresh Iterators: Each window creates completely new iterators (no BKD state)
  4. CancellableBulkScorer Compatible: Returns lower docIDs when window expands

Implementation Details

  • ApproximateBooleanScorerSupplier: Manages windowed expansion logic
  • ApproximatePointRangeQuery: Modified to accept dynamic window sizes via getWithSize()
  • Fresh BKD Traversal: Each window starts from root, avoiding ResumableDISI and BKD state

Future Optimization Opportunity

The current windowed approach has repeated work overhead - each window re-processes documents from the beginning.

Proposed Enhancement

Store a copy of the most recent DISI as an instance variable and use public void visit(DocIdSetIterator iterator) in ApproximatePointRangeQuery to:

  1. Resume from last position: Add the stored DISI to the currently constructed one using BKDState
  2. Eliminate repeated collection: Only collect new documents beyond the previous window
  3. Minimize overhead: Only leapfrogging conjunction work remains, no document re-processing

Benefits

  • Performance: Eliminates redundant document collection across windows
  • Efficiency: Maintains windowed approach benefits while reducing computational overhead
  • Scalability: Better performance for queries requiring large windows

Advanced Expansion Strategies

The current 3x expansion is arbitrary and may not be optimal for all use cases. Future enhancements could include:

1. Cost-Based Fast Expansion

Expand iterators in order of their costs for dense hit scenarios:

  • Example: Iterators with costs (200k, 800k, 10M, 30M)
  • Strategy: Expand by 200k first, then 800k, then 10M, etc.
  • Benefit: Dense datasets find hits quickly with smaller expansions

2. Hit-Based Adaptive Expansion

Scale expansion based on conjunction hit density:

  • Initial: max(10k, minCost/2^7)
  • Adaptive Scaling:
    • 1k hits found → expand by 10x
    • 5k hits found → expand by 2x
    • High hit density → smaller expansions
    • Low hit density → larger expansions

3. User-Configurable Strategies

Allow users to choose expansion strategy based on their data characteristics:

  • Fast Expansion: Cost-ordered for dense datasets
  • Adaptive Expansion: Hit-density based scaling
  • Fixed Expansion: Current 3x approach (default)
  • Custom: User-defined expansion factors

Implementation Considerations

  • Strategy Selection: Query-level or index-level configuration
  • Performance Metrics: Track expansion efficiency for auto-tuning
  • Fallback Logic: Default to fixed expansion if adaptive strategies fail

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

🆕 New

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions