mirdata doesn't work cleanly for datasets not on disk #128

rabitt · 2019-09-13T16:10:25Z

For the use case where the data files live on a remote machine and are accessed when needed for e.g. for training a model, many of mirdata's assumptions fail:

validation expects files to be local
annotation loaders look for files locally
annotation loading functions are hidden from the user
Track annotation attributes are loaded in the background from disk, expecting files to be present

How can we support this (relatively common) use case cleanly? This will be increasingly important for larger datasets. Some initial ideas:

expose annotation loading functions to the user
create functions for validating a single file within a dataset
better handle errors when accessing Track attributes that load files from disk

cc @faroit

The text was updated successfully, but these errors were encountered:

faroit · 2019-09-13T18:45:58Z

hi there. I didn't really look into mirdata yet but I consider adding MUSDB18.... are does multichannel audio work in mirdata?

How can we support this (relatively common) use case cleanly?

can you elaborate on this use case a bit more and give some examples? are you talking about applications where the data is is in key/value databases?

expose annotation loading functions to the user

Yes, I would do both, the audio loader and the annotation loader.

rabitt · 2019-09-13T18:55:18Z

does multichannel audio work in mirdata?

Yes, as of #125 We're just using librosa, which now supports multichannel audio.

can you elaborate on this use case a bit more and give some examples? are you talking about applications where the data is is in key/value databases?

In my case, I'm thinking about whe use case where you want to train using a dataset which is stored remotely. One workflow is to download the audio/annotation files as they're needed and call the loading functions manually. What kind of workflow are you thinking of in the database use case?

rabitt · 2019-10-25T19:25:12Z

Me and @magdalenafuentes discussed this a bit more, and here are some things we think would help make this usable for remote datasets:

Add load_audio functions for each loader
Expose annotation and audio paths as attributes in Track objects
Expose all load_* functions, and add documentation for which annotation paths they are built to load
Add an example in the documentation of loading local vs. remote data

rabitt added the question Further information is requested label Sep 13, 2019

rabitt added this to the 0.1.0 First Stable Release milestone Nov 2, 2019

rabitt self-assigned this Nov 2, 2019

rabitt added enhancement New feature or request and removed question Further information is requested labels Nov 2, 2019

rabitt removed this from the 0.1.0 First Stable Release milestone Nov 4, 2019

rabitt mentioned this issue Nov 6, 2019

How should we handle big datasets? #20

Closed

rabitt added this to the 0.2.0 milestone Nov 22, 2019

rabitt mentioned this issue Mar 6, 2020

better support for remote data #188

Merged

6 tasks

rabitt closed this as completed in #188 Mar 13, 2020

lostanlen mentioned this issue Apr 8, 2020

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mirdata doesn't work cleanly for datasets not on disk #128

mirdata doesn't work cleanly for datasets not on disk #128

rabitt commented Sep 13, 2019

faroit commented Sep 13, 2019

rabitt commented Sep 13, 2019

rabitt commented Oct 25, 2019

mirdata doesn't work cleanly for datasets not on disk #128

mirdata doesn't work cleanly for datasets not on disk #128

Comments

rabitt commented Sep 13, 2019

faroit commented Sep 13, 2019

rabitt commented Sep 13, 2019

rabitt commented Oct 25, 2019