Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should dataset modules have a common Dataset class? #225

Closed
rabitt opened this issue Apr 8, 2020 · 3 comments
Closed

Should dataset modules have a common Dataset class? #225

rabitt opened this issue Apr 8, 2020 · 3 comments
Labels
question Further information is requested

Comments

@rabitt
Copy link
Collaborator

rabitt commented Apr 8, 2020

Our current API has the following functions required by every dataset module, and all instances are nearly identical:

  • .download()
  • .validate()
  • .track_ids()
  • .load()
  • .cite()

In #219 @lostanlen experimented with adding a Dataset class which implements these methods. In addition, he added the following:

  • .readme (module documentation)
  • .__getitem__
  • dataset_default_path
  • .index
  • .metadata
  • .choice for random iteration (I <3 this idea)

The original decision to not implement a module-level dataset level object was because of usability. In the current API, to, for example, download a dataset:

from mirdata import orchset
orchset.download()

Whereas, if download lived inside a dataset object, the API would be:

from mirdata import orchset
orchset_dataset = orchset.Dataset()
orchset_dataset.download()

The current solution has the following pros and cons.
Pros:

  • the API is intuitive
  • The code is relatively easy to follow

Cons:

  • We have nearly repeated code in every module
  • the inclusion of the 5 methods is only enforced by test_loaders + code review.
  • There is more code to write (well, copy paste...) when contributing.

Opening this issue to restart the discussion of if a Dataset object makes sense, and if so, what the API would look like. I like the idea of generalizing and simplifying the code, as long as it doesn't come at the cost of usability & ease of contribution.

@rabitt rabitt added the question Further information is requested label Apr 8, 2020
@lostanlen
Copy link
Collaborator

lostanlen commented Apr 8, 2020

Thank you for opening this. For the record, #219 was just a wild attempt at unifying module APIs. I didn't see it as a take-it-or-leave-it situation, but more as a conversation starter.

Regarding the example code you wentioned about download, my dream is that the user wouldn't have to import any submodules manually.
So the end goal is not:

from mirdata import orchset
orchset_dataset = orchset.Dataset()
orchset_dataset.download()

but really

orchset = mirdata.download("ORCHSET")

(we can talk about case sensitivity if you want to allow for Orchset and orchset aliases)

Where mirdata.download would be implemented in some core.py Python file.
Here's the (pseudo-)source code to achieve it:

def find_submodule(name, case_sensitive=True):
    if not case_sensitive:
        name = name.lower()
    for submodule_str in mirdata.__all__:
        module_str = "mirdata." + submodule_str
        submodule_name = importlib.import_module(module_str).name
        if not case_sensitive:
            submodule_name = submodule_name.lower()
        if name == submodule.name:
            return submodule
        
def download(
        name,
        data_home=None, 
        force_overwrite=False,
        cleanup=False,
        download_items=None
        case_sensitive=True):
    submodule = find_submodule(name, case_sensitive=case_sensitive)
    dataset = mirdata.dataset.Dataset(
        submodule, data_home=data_home)
    dataset.download(
        force_overwrite=force_overwrite,
        cleanup=cleanup,
        download_items=download_items)
    return dataset

and voilà: the return value of this download function is a fully initialized dataset. Granted, this dataset is an object, but it works exactly as a mirdata module as of v0.2. You can do orchset.cite(), orchset.track_ids(), orchset.validate(), and so on; without having to write these functions in the orchset.py module itself.

@rabitt
Copy link
Collaborator Author

rabitt commented Apr 9, 2020

The usage example is really useful. Actually we considered this direction at the very beginning of this library - whether mirdata should be "dataset-major" (mirdata.orchset.download) or if it should be "function-major" (mirdata.download('orchset'). We ultimately went with dataset-major because we felt it was more readable. But we can revisit this decision. Note that this would be a major breaking change and not easily backwards compatible.

Could you give usage examples for how you envision the following:

  • Accessing one particular Track object by track_id from a particular dataset
  • Calling e.g. salami.load_beats manually for an arbitrary file path.

@rabitt
Copy link
Collaborator Author

rabitt commented Nov 3, 2020

This is implemented in #296 !

@rabitt rabitt closed this as completed Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants