Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add respectRobotsTxtFile crawler option #2910

Merged
merged 10 commits into from
Apr 4, 2025
Merged

Conversation

B4nan
Copy link
Member

@B4nan B4nan commented Apr 2, 2025

This PR implements automatic skipping of requests based on the robots.txt file. It works based on a new boolean flag in the crawler options called respectRobotsTxtFile.

Checks are implemented in BasicCrawler._runTaskFunction(), enqueueLinks (the context helper), and BasicCrawler.addRequests(), so we ensure disallowed requests are neither processed, nor added to the queue (this only works with the enqueueLinks context helper). The robots.txt files are cached based on the URL origin, so we should fetch it only once for each domain (up to 1000 domains, uses LRU cache). We don't use a proxy to fetch the robots.txt file, which should be fine, since the file should be accessible by robots in general.

We might want to enable this by default in the next major.

@github-actions github-actions bot added this to the 111th sprint - Tooling team milestone Apr 2, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 2, 2025
@B4nan B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Apr 2, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Apr 2, 2025
@B4nan B4nan marked this pull request as ready for review April 2, 2025 13:58
@B4nan B4nan requested review from janbuchar and barjin April 2, 2025 14:01
Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small feature request, otherwise lgtm

return robotsFile.isAllowed(request.url);
});
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is simple, I would also support this filter in crawler.addRequests. I know it is a small wrapper above requestQueue.addRequests but since it is on the crawler object, users will expect it will respect robots. It would drop those requests later when fetching but polluting and draining the queue is bad for performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking about that one as well, it should be simple, will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented via e230191

btw this is just a perf optimization, technically, it was already working this way, since we check if the request is valid inside the _runTaskFunction. now we also skip the disallowed ones from adding to the queue via crawler.addRequests like we do with enqueueLinks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like we do with enqueueLinks

Note that for enqueueLinks, we'll only check against the current sitemap (possibly enqueueing forbidden different-domain links), but with addRequests, we'll check the robots.txt files for all of the links separately (possibly downloading many robots.txt files).

It kinda makes sense to me (and as you're saying, it's just a matter of performance), just making sure we all understand this right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nvm, the performance difference is just RQ utilization, the requests to robots.txt files will be made either way)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for enqueueLinks, we'll only check against the current sitemap

We check the URLs based on the robots.txt for the originating request (I guess the "sitemap" is a typo? we don't fetch/check sitemaps here). If there is a link that goes outside of the domain, it will be enqueued as usual (if allowed by the enqueue strategy) and checked again when processing. With addRequests, we don't know where they came from, so we need to check them one by one. We have a cache for this, so if they are all from the same domain, we only fetch the robots file once.

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, only few questions / nits

Comment on lines 1170 to 1177
const robotsFile = await RobotsFile.find(url);
this.robotsFileCache.add(origin, robotsFile);

return robotsFile;
} catch (e: any) {
this.log.warning(`Failed to fetch robots.txt for request ${url}`);
return undefined;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server returns a non-404 error response for /robots.txt (RobotsFile.find() throws an error), we'll keep trying to download it for every single request from that origin (possibly never succeeding)

I'm fine with that (respectRobotsFile is opt-in, the user has to want it), just making sure this is the expected behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not really sure if this is something to fix, it feels correct to check again in case of failure. Maybe we could skip all 4xx codes, but for 5xx it feels wrong to skip trying. One way or the other, this feels almost unreachable, if the site cannot serve a proper response for a static file like robots.txt, it's quite likely no requests will work from it.

I'd vote to keep this as is and deal with issues when someone reports them, so we know what edge case we are trying to solve.

return robotsFile.isAllowed(request.url);
});
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like we do with enqueueLinks

Note that for enqueueLinks, we'll only check against the current sitemap (possibly enqueueing forbidden different-domain links), but with addRequests, we'll check the robots.txt files for all of the links separately (possibly downloading many robots.txt files).

It kinda makes sense to me (and as you're saying, it's just a matter of performance), just making sure we all understand this right.

@jancurn
Copy link
Member

jancurn commented Apr 2, 2025

Cool, well done! Please can we call the option respectRobotsTxtFile instead of respectRobotsFile ? That's how the standard is called, there's no thing as "robots file". And ideally rename other user-facing parameter names similarly, for consistency

@B4nan
Copy link
Member Author

B4nan commented Apr 2, 2025

I can rename the options added as part of this PR, sure. I'd leave the renaming of RobotsFile (with an alias so we're not breaking) for a separate PR.

@B4nan B4nan changed the title feat: add respectRobotsFile crawler option feat: add respectRobotsTxtFile crawler option Apr 3, 2025
Comment on lines 1051 to 1057
for (const request of requests) {
const url = typeof request === 'string' ? request : request.url!;

if (await this.isAllowedBasedOnRobotsTxtFile(url)) {
allowedRequests.push(request);
}
}
Copy link
Contributor

@barjin barjin Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random note: addRequests() is called on the first argument of crawler.run().

If we're planning to make this default in 4.0, it would mean that the crawler spends the first few seconds loading the robots.txt files and only then starts processing the actual URLs.

Probably no biggie for the majority of users (so IMO no need for action now), but for people crawling loads of URLs from different domains, this might cause a surprise (no logging aside from the debug, so the crawler might seem "stuck").

Copy link
Member Author

@B4nan B4nan Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it won't be seconds, since the request is fired without a proxy and the file will/should be really small. But we can always adjust, I am sure we'll learn about all those edge cases once we start adopting this in WCC.

Maybe we could print a warning if the file is not fetched within a second or something similar.

Copy link
Member

@metalwarrior665 metalwarrior665 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could group these by domain (so we don't hit the same file multiple times) and then do this in concurrent batches. Maybe overkill for now but I would add at least comment with this idea because in some cases like 5000 random domains on startUrls, this might stall for few minutes (not sure how many URLs addRequests does in the "sync part")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group these by domain (so we don't hit the same file multiple times) and then do this in concurrent batches

Alternatively, we can just put them all in the RQ and only skip them once we actually get to them. It might put a bigger load on the RQ, but it will allow us to interleave (possibly) good requests with blocked ones, minimizing the "time to first scraped result" (not having to wait for possibly unrelated robots.txt files to load).

Agreed with overkill for now, agreed with adding a comment with some of those ideas :)

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, nice one! Please address my comments before merging.

@B4nan
Copy link
Member Author

B4nan commented Apr 3, 2025

e2e tests with the latest changes running here

@B4nan B4nan merged commit 0eabed1 into master Apr 4, 2025
12 checks passed
@B4nan B4nan deleted the robots-file-check branch April 4, 2025 07:47
B4nan added a commit that referenced this pull request Apr 4, 2025
The old name is still supported as an alias, will be removed in v4.

Related #2910
@B4nan
Copy link
Member Author

B4nan commented Apr 4, 2025

Renaming of the RobotsFile class is here: #2913

B4nan added a commit that referenced this pull request Apr 4, 2025
The old name is still supported as an alias, will be removed in v4.

Related #2910
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants