Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Datasets #143

Merged
merged 17 commits into from
Nov 19, 2024
Merged

Hugging Face Datasets #143

merged 17 commits into from
Nov 19, 2024

Conversation

PaulHax
Copy link
Collaborator

@PaulHax PaulHax commented Nov 12, 2024

Test it out:
nrtk-explorer --dataset cppe-5 beans rafaelpadilla/coco2017 keremberke/german-traffic-sign-detection mrtoy/mobile-ui-design keremberke/construction-safety-object-detection keremberke/table-extraction

Use the --download CLI arg to cache the Hugging Face dataset localy. Dataset streaming is the default.
nrtk-explorer --dataset cppe-5 --download

Adds a dataset[vision] dependency.

Object Detection task tagged datasets:
https://huggingface.co/datasets?task_categories=task_categories:object-detection
Does not support them all =/

@PaulHax PaulHax marked this pull request as draft November 12, 2024 15:41
@PaulHax PaulHax force-pushed the hug-dataset branch 2 times, most recently from fbca179 to e18d769 Compare November 12, 2024 16:21
@PaulHax PaulHax marked this pull request as ready for review November 12, 2024 16:26
@PaulHax PaulHax requested a review from alesgenova November 13, 2024 14:15
for split_name in info.splits:
streaming_str = "streaming" if streaming else "download"
expanded_identifiers.append(
f"{identifier}@{config_name}@{split_name}@{streaming_str}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bananas 🍌 Is there a better way\ to bring some "config" along with the hugging face dataset?

@PaulHax PaulHax force-pushed the hug-dataset branch 3 times, most recently from f5da6b7 to 684e2ee Compare November 14, 2024 00:48
@PaulHax
Copy link
Collaborator Author

PaulHax commented Nov 14, 2024

@Erotemic Have you ever tried shoehorning a Hugging Face dataset into a COCO shape? Or any other image dataset format for that matter?

@Erotemic
Copy link
Member

@PaulHax other data formats: yes. It is usually possible because the different image / video annotation formats are almost all doing the same thing in different ways. Some of them make more assumptions than others. I think I've done a reasonable job at giving kwcoco the representation power to handle almost everything.

Here are examples of code in kwcoco to handle and help convert to/from other formats: https://gitlab.kitware.com/computer-vision/kwcoco/-/tree/main/kwcoco/formats?ref_type=heads

Namely, I have code for:

On top of this, I've done fairly specific conversions for VIAME, which code is scattered around for: https://github.com/VIAME/bioharn/tree/main/dev/data_tools

There have also been different non-public formats that can be converted.

Copy link
Member

@alesgenova alesgenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this branch out and it works well!
I don't see anything wrong in the code, but I also don't have experience with kwcoco and hf

@PaulHax PaulHax linked an issue Nov 18, 2024 that may be closed by this pull request
@PaulHax PaulHax merged commit fd8d43d into main Nov 19, 2024
12 checks passed
@PaulHax PaulHax deleted the hug-dataset branch November 19, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embedding opening too many files?
3 participants