Should dataset modules have a common Dataset class? #225

rabitt · 2020-04-08T18:51:39Z

Our current API has the following functions required by every dataset module, and all instances are nearly identical:

.download()
.validate()
.track_ids()
.load()
.cite()

In #219 @lostanlen experimented with adding a Dataset class which implements these methods. In addition, he added the following:

.readme (module documentation)
.__getitem__
dataset_default_path
.index
.metadata
.choice for random iteration (I <3 this idea)

The original decision to not implement a module-level dataset level object was because of usability. In the current API, to, for example, download a dataset:

from mirdata import orchset
orchset.download()

Whereas, if download lived inside a dataset object, the API would be:

from mirdata import orchset
orchset_dataset = orchset.Dataset()
orchset_dataset.download()

The current solution has the following pros and cons.
Pros:

the API is intuitive
The code is relatively easy to follow

Cons:

We have nearly repeated code in every module
the inclusion of the 5 methods is only enforced by test_loaders + code review.
There is more code to write (well, copy paste...) when contributing.

Opening this issue to restart the discussion of if a Dataset object makes sense, and if so, what the API would look like. I like the idea of generalizing and simplifying the code, as long as it doesn't come at the cost of usability & ease of contribution.

The text was updated successfully, but these errors were encountered:

lostanlen · 2020-04-08T19:51:41Z

Thank you for opening this. For the record, #219 was just a wild attempt at unifying module APIs. I didn't see it as a take-it-or-leave-it situation, but more as a conversation starter.

Regarding the example code you wentioned about download, my dream is that the user wouldn't have to import any submodules manually.
So the end goal is not:

from mirdata import orchset
orchset_dataset = orchset.Dataset()
orchset_dataset.download()

but really

orchset = mirdata.download("ORCHSET")

(we can talk about case sensitivity if you want to allow for Orchset and orchset aliases)

Where mirdata.download would be implemented in some core.py Python file.
Here's the (pseudo-)source code to achieve it:

def find_submodule(name, case_sensitive=True):
    if not case_sensitive:
        name = name.lower()
    for submodule_str in mirdata.__all__:
        module_str = "mirdata." + submodule_str
        submodule_name = importlib.import_module(module_str).name
        if not case_sensitive:
            submodule_name = submodule_name.lower()
        if name == submodule.name:
            return submodule
        
def download(
        name,
        data_home=None, 
        force_overwrite=False,
        cleanup=False,
        download_items=None
        case_sensitive=True):
    submodule = find_submodule(name, case_sensitive=case_sensitive)
    dataset = mirdata.dataset.Dataset(
        submodule, data_home=data_home)
    dataset.download(
        force_overwrite=force_overwrite,
        cleanup=cleanup,
        download_items=download_items)
    return dataset

and voilà: the return value of this download function is a fully initialized dataset. Granted, this dataset is an object, but it works exactly as a mirdata module as of v0.2. You can do orchset.cite(), orchset.track_ids(), orchset.validate(), and so on; without having to write these functions in the orchset.py module itself.

rabitt · 2020-04-09T13:22:37Z

The usage example is really useful. Actually we considered this direction at the very beginning of this library - whether mirdata should be "dataset-major" (mirdata.orchset.download) or if it should be "function-major" (mirdata.download('orchset'). We ultimately went with dataset-major because we felt it was more readable. But we can revisit this decision. Note that this would be a major breaking change and not easily backwards compatible.

Could you give usage examples for how you envision the following:

Accessing one particular Track object by track_id from a particular dataset
Calling e.g. salami.load_beats manually for an arbitrary file path.

rabitt · 2020-11-03T22:14:50Z

This is implemented in #296 !

rabitt added the question Further information is requested label Apr 8, 2020

lostanlen mentioned this issue Apr 8, 2020

What information can be abstracted to the track.Track object? #224

Closed

magdalenafuentes mentioned this issue Oct 8, 2020

Restructure index files #277

Closed

This was referenced Oct 19, 2020

Generalize copy-paste dataset functions to utils? #293

Closed

Dataset object #296

Merged

rabitt closed this as completed Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should dataset modules have a common Dataset class? #225

Should dataset modules have a common Dataset class? #225

rabitt commented Apr 8, 2020

lostanlen commented Apr 8, 2020 •

edited

Loading

rabitt commented Apr 9, 2020

rabitt commented Nov 3, 2020

Should dataset modules have a common Dataset class? #225

Should dataset modules have a common Dataset class? #225

Comments

rabitt commented Apr 8, 2020

lostanlen commented Apr 8, 2020 • edited Loading

rabitt commented Apr 9, 2020

rabitt commented Nov 3, 2020

lostanlen commented Apr 8, 2020 •

edited

Loading