Filter links from a crawled page #283

rumpl · 2025-04-03T14:08:53Z

The on_should_crawl_callback is useful but it happens after the page was crawled. The issue with this is that spider potentially crawls too many pages.

It would be great if we had something that happens at the same time as on_link_find_callback but that would filter the links found on an already crawled page.

Something like on_should_crawl_link, which returns true or false and filters out links found on an already crawled page. Or maybe change how on_link_find_callback works and having it mean that if the url sent back is empty (or None, but this means the API changes, I'm not sure how stable spider's API is).

The text was updated successfully, but these errors were encountered:

rumpl · 2025-04-03T17:57:24Z

I guess whietlisting with a regex could do the job for me but it would seem that the whitelist also happens at the website.configuration.delay rate, which isn't ideal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter links from a crawled page #283

Filter links from a crawled page #283

rumpl commented Apr 3, 2025

rumpl commented Apr 3, 2025

Filter links from a crawled page #283

Filter links from a crawled page #283

Comments

rumpl commented Apr 3, 2025

rumpl commented Apr 3, 2025