Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter links from a crawled page #283

Open
rumpl opened this issue Apr 3, 2025 · 1 comment
Open

Filter links from a crawled page #283

rumpl opened this issue Apr 3, 2025 · 1 comment

Comments

@rumpl
Copy link

rumpl commented Apr 3, 2025

The on_should_crawl_callback is useful but it happens after the page was crawled. The issue with this is that spider potentially crawls too many pages.

It would be great if we had something that happens at the same time as on_link_find_callback but that would filter the links found on an already crawled page.

Something like on_should_crawl_link, which returns true or false and filters out links found on an already crawled page. Or maybe change how on_link_find_callback works and having it mean that if the url sent back is empty (or None, but this means the API changes, I'm not sure how stable spider's API is).

@rumpl
Copy link
Author

rumpl commented Apr 3, 2025

I guess whietlisting with a regex could do the job for me but it would seem that the whitelist also happens at the website.configuration.delay rate, which isn't ideal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant