You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The on_should_crawl_callback is useful but it happens after the page was crawled. The issue with this is that spider potentially crawls too many pages.
It would be great if we had something that happens at the same time as on_link_find_callback but that would filter the links found on an already crawled page.
Something like on_should_crawl_link, which returns true or false and filters out links found on an already crawled page. Or maybe change how on_link_find_callback works and having it mean that if the url sent back is empty (or None, but this means the API changes, I'm not sure how stable spider's API is).
The text was updated successfully, but these errors were encountered:
I guess whietlisting with a regex could do the job for me but it would seem that the whitelist also happens at the website.configuration.delay rate, which isn't ideal
The
on_should_crawl_callback
is useful but it happens after the page was crawled. The issue with this is that spider potentially crawls too many pages.It would be great if we had something that happens at the same time as
on_link_find_callback
but that would filter the links found on an already crawled page.Something like
on_should_crawl_link
, which returns true or false and filters out links found on an already crawled page. Or maybe change howon_link_find_callback
works and having it mean that if the url sent back is empty (or None, but this means the API changes, I'm not sure how stable spider's API is).The text was updated successfully, but these errors were encountered: