Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining a large dataset with several splits under it #199

Open
MaramHasanain opened this issue Sep 4, 2023 · 1 comment
Open

Defining a large dataset with several splits under it #199

MaramHasanain opened this issue Sep 4, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@MaramHasanain
Copy link
Contributor

We have a recurrent format of some datasets where the same dataset will have multiple splits under each, where splits are different by language, subtask, train-dev-test, etc. but have the same file structure.
Our current implementation assumes we will have one dataset per split (especially with the metadata specifying dataset language for example), OR we will have an ad-hoc method of using the same dataset class but passing different splits file names with different assets.

I think we should find a more unified way to handle such cases (e.g., parent dataset and subsets under each, where subsets are different by metadata only for example).

@MaramHasanain MaramHasanain added the enhancement New feature or request label Sep 4, 2023
@MaramHasanain MaramHasanain changed the title Handle large dataset with several splits under each Defining a large dataset with several splits under it Sep 4, 2023
@fdalvi
Copy link
Collaborator

fdalvi commented Sep 4, 2023

This is a good suggestion, we can do several things here:

  1. Have a single ParentDataset which needs a dataset arg that "sets" a particular lang/split; metadata in this case would highlight the multilingual nature
  2. Have a single ParentDataset class like above + several child classes that inherit from the parent class, with only a single line of implementation that calls the parent class constructor with a particular language/task set. E.g of this would be a parent CT22Dataset class and a child CT22CheckworthinessDatasetclass. The child classes in this case will just be convenience/syntactic sugar, but might be useful imo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants