-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add respectRobotsTxtFile
crawler option
#2910
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small feature request, otherwise lgtm
return robotsFile.isAllowed(request.url); | ||
}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is simple, I would also support this filter in crawler.addRequests
. I know it is a small wrapper above requestQueue.addRequests
but since it is on the crawler
object, users will expect it will respect robots. It would drop those requests later when fetching but polluting and draining the queue is bad for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was thinking about that one as well, it should be simple, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented via e230191
btw this is just a perf optimization, technically, it was already working this way, since we check if the request is valid inside the _runTaskFunction
. now we also skip the disallowed ones from adding to the queue via crawler.addRequests
like we do with enqueueLinks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like we do with
enqueueLinks
Note that for enqueueLinks
, we'll only check against the current sitemap (possibly enqueueing forbidden different-domain links), but with addRequests
, we'll check the robots.txt
files for all of the links separately (possibly downloading many robots.txt
files).
It kinda makes sense to me (and as you're saying, it's just a matter of performance), just making sure we all understand this right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nvm, the performance difference is just RQ utilization, the requests to robots.txt
files will be made either way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for enqueueLinks, we'll only check against the current sitemap
We check the URLs based on the robots.txt for the originating request (I guess the "sitemap" is a typo? we don't fetch/check sitemaps here). If there is a link that goes outside of the domain, it will be enqueued as usual (if allowed by the enqueue strategy) and checked again when processing. With addRequests
, we don't know where they came from, so we need to check them one by one. We have a cache for this, so if they are all from the same domain, we only fetch the robots file once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, only few questions / nits
const robotsFile = await RobotsFile.find(url); | ||
this.robotsFileCache.add(origin, robotsFile); | ||
|
||
return robotsFile; | ||
} catch (e: any) { | ||
this.log.warning(`Failed to fetch robots.txt for request ${url}`); | ||
return undefined; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server returns a non-404 error response for /robots.txt
(RobotsFile.find()
throws an error), we'll keep trying to download it for every single request from that origin (possibly never succeeding)
I'm fine with that (respectRobotsFile
is opt-in, the user has to want it), just making sure this is the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am not really sure if this is something to fix, it feels correct to check again in case of failure. Maybe we could skip all 4xx codes, but for 5xx it feels wrong to skip trying. One way or the other, this feels almost unreachable, if the site cannot serve a proper response for a static file like robots.txt, it's quite likely no requests will work from it.
I'd vote to keep this as is and deal with issues when someone reports them, so we know what edge case we are trying to solve.
return robotsFile.isAllowed(request.url); | ||
}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like we do with
enqueueLinks
Note that for enqueueLinks
, we'll only check against the current sitemap (possibly enqueueing forbidden different-domain links), but with addRequests
, we'll check the robots.txt
files for all of the links separately (possibly downloading many robots.txt
files).
It kinda makes sense to me (and as you're saying, it's just a matter of performance), just making sure we all understand this right.
Cool, well done! Please can we call the option |
I can rename the options added as part of this PR, sure. I'd leave the renaming of |
respectRobotsFile
crawler optionrespectRobotsTxtFile
crawler option
for (const request of requests) { | ||
const url = typeof request === 'string' ? request : request.url!; | ||
|
||
if (await this.isAllowedBasedOnRobotsTxtFile(url)) { | ||
allowedRequests.push(request); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random note: addRequests()
is called on the first argument of crawler.run()
.
If we're planning to make this default in 4.0, it would mean that the crawler spends the first few seconds loading the robots.txt
files and only then starts processing the actual URLs.
Probably no biggie for the majority of users (so IMO no need for action now), but for people crawling loads of URLs from different domains, this might cause a surprise (no logging aside from the debug
, so the crawler might seem "stuck").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it won't be seconds, since the request is fired without a proxy and the file will/should be really small. But we can always adjust, I am sure we'll learn about all those edge cases once we start adopting this in WCC.
Maybe we could print a warning if the file is not fetched within a second or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could group these by domain (so we don't hit the same file multiple times) and then do this in concurrent batches. Maybe overkill for now but I would add at least comment with this idea because in some cases like 5000 random domains on startUrls, this might stall for few minutes (not sure how many URLs addRequests does in the "sync part")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group these by domain (so we don't hit the same file multiple times) and then do this in concurrent batches
Alternatively, we can just put them all in the RQ and only skip them once we actually get to them. It might put a bigger load on the RQ, but it will allow us to interleave (possibly) good requests with blocked ones, minimizing the "time to first scraped result" (not having to wait for possibly unrelated robots.txt
files to load).
Agreed with overkill for now, agreed with adding a comment with some of those ideas :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me, nice one! Please address my comments before merging.
e2e tests with the latest changes running here |
The old name is still supported as an alias, will be removed in v4. Related #2910
Renaming of the |
The old name is still supported as an alias, will be removed in v4. Related #2910
This PR implements automatic skipping of requests based on the robots.txt file. It works based on a new boolean flag in the crawler options called
respectRobotsTxtFile
.Checks are implemented in
BasicCrawler._runTaskFunction()
,enqueueLinks
(the context helper), andBasicCrawler.addRequests()
, so we ensure disallowed requests are neither processed, nor added to the queue (this only works with theenqueueLinks
context helper). The robots.txt files are cached based on the URL origin, so we should fetch it only once for each domain (up to 1000 domains, uses LRU cache). We don't use a proxy to fetch the robots.txt file, which should be fine, since the file should be accessible by robots in general.We might want to enable this by default in the next major.