Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TandemRequestProvider for combined RequestList and RequestQueue usage #2914

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CodeMan62
Copy link

Overview

This PR introduces two new components to improve request management in Crawlee:

  • TandemRequestProvider: A provider that seamlessly combines RequestList and RequestQueue, allowing crawlers to use both sources efficiently.
  • RequestListAdapter: An adapter that makes RequestList compatible with the IRequestProvider interface.
    Fixes #

Problem Solved

When using both RequestList and RequestQueue in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:

  1. First processing requests from the RequestList
  2. Automatically transferring these requests to the RequestQueue in the background
  3. Ensuring each URL is processed exactly once
  4. Gracefully transitioning to RequestQueue after the list is exhausted

Implementation Details

  • The RequestListAdapter wraps a RequestList instance and adapts its interface to match IRequestProvider.
  • The TandemRequestProvider orchestrates the flow between the list and queue:
    • It implements the IRequestProvider interface
    • Handles request processing through the queue while ensuring list requests are enqueued first
    • Maintains proper request state tracking across both sources
    • Implements a background transfer mechanism to move requests efficiently

Integration

The implementation is already integrated into BasicCrawler._initializeRequestProviders(), making it immediately available to all crawler types without requiring additional configuration changes.

Backward Compatibility

This change is fully backward compatible:

  • Existing code using either RequestList or RequestQueue alone will continue to work without changes
  • This introduces an enhancement without modifying existing behavior

Testing

The implementation was tested to ensure it correctly:

  • Transfers requests from list to queue
  • Processes requests in the correct order
  • Handles errors appropriately
  • Maintains proper request state

This feature will simplify many crawler implementations and eliminate a common source of bugs when dealing with multiple request sources.

@barjin
Copy link
Contributor

barjin commented Apr 5, 2025

Please merge the latest master into your branch so we only get your changes in the diff. Cheers!

@CodeMan62 CodeMan62 force-pushed the feat/tandem-request-provider branch from fbd4a80 to 7b7260e Compare April 5, 2025 19:42
@CodeMan62
Copy link
Author

@barjin Everything is done know you can review the PR thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants