Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add recursive sitemap support to GitbookLoader with concurrent processing #30681

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

andrasfe
Copy link
Contributor

@andrasfe andrasfe commented Apr 4, 2025

Description:

Enhanced GitbookLoader to support recursive sitemap structures and asynchronous processing. The loader now recursively processes sitemap index files, following links to child sitemaps, and extracts all URLs to content pages. Also added async processing.

Issue:

Fixes #30629 - GitbookLoader fails to process nested sitemaps

Dependencies:

Added lxml package to test_integration dependencies for proper XML parsing in integration tests.

…ded async processing for speeding document loading
Copy link

vercel bot commented Apr 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 5, 2025 1:40pm

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Apr 4, 2025
@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 4, 2025

Could we speed up web loader instead of making changes in gitbook loader?

@eyurtsev eyurtsev self-assigned this Apr 4, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 5, 2025
@andrasfe andrasfe force-pushed the community-gitbook-recursive-sitemap branch from f29f11c to 10d6ad5 Compare April 5, 2025 13:40
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 5, 2025
@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 5, 2025

Could we speed up web loader instead of making changes in gitbook loader?

You are right - any optimizations should be performed in WebBaseLoader. Nevertheless, this is a more involved task as it has lots of dependencies, so perhaps this could be another ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: GitbookLoader fails to process nested sitemaps
2 participants