Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken links in pdf loading documentation #7494

Open
VyoJ opened this issue Apr 2, 2025 · 0 comments
Open

Broken links in pdf loading documentation #7494

VyoJ opened this issue Apr 2, 2025 · 0 comments

Comments

@VyoJ
Copy link

VyoJ commented Apr 2, 2025

Describe the bug

Hi, just a couple of small issues I ran into while reading the docs for loading pdf data:

  1. The link for the Create a pdf dataset points to https://huggingface.co/docs/datasets/main/en/pdf_dataset instead of https://huggingface.co/docs/datasets/main/en/document_dataset and hence gives a 404 error.

  2. At the top of the page, it's mentioned that to work with pdf datasets we need to have the pdfplumber package installed but the link to its installation guide points to pytorch/vision installation instructions instead of pdfplumber's guide

I love the work on enabling pdf dataset support and these small tweaks would help everyone navigate the docs better. Thanks!

Steps to reproduce the bug

The issue is on the Load Document Data page of the datasets docs.

Expected behavior

  1. For solving the first issue, I went through the source .mdx code of the datasets docs and found that the link is pointing to ./pdf_dataset instead of ./document_dataset

  2. For the second issue, I went through the source .mdx code of the datasets docs and found that the link is pytorch/vision installation instructions instead of pdfplumber's guide

Just replacing these two links should fix the bugs

Environment info

datasets v3.5.0 (main at the time of writing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant