Add TandemRequestProvider for combined RequestList and RequestQueue usage #2914
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces two new components to improve request management in Crawlee:
TandemRequestProvider
: A provider that seamlessly combinesRequestList
andRequestQueue
, allowing crawlers to use both sources efficiently.RequestListAdapter
: An adapter that makesRequestList
compatible with theIRequestProvider
interface.Fixes #
Problem Solved
When using both
RequestList
andRequestQueue
in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:RequestList
RequestQueue
in the backgroundRequestQueue
after the list is exhaustedImplementation Details
RequestListAdapter
wraps aRequestList
instance and adapts its interface to matchIRequestProvider
.TandemRequestProvider
orchestrates the flow between the list and queue:IRequestProvider
interfaceIntegration
The implementation is already integrated into
BasicCrawler._initializeRequestProviders()
, making it immediately available to all crawler types without requiring additional configuration changes.Backward Compatibility
This change is fully backward compatible:
RequestList
orRequestQueue
alone will continue to work without changesTesting
The implementation was tested to ensure it correctly:
This feature will simplify many crawler implementations and eliminate a common source of bugs when dealing with multiple request sources.