Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mirdata doesn't work cleanly for datasets not on disk #128

Closed
rabitt opened this issue Sep 13, 2019 · 3 comments · Fixed by #188
Closed

mirdata doesn't work cleanly for datasets not on disk #128

rabitt opened this issue Sep 13, 2019 · 3 comments · Fixed by #188
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@rabitt
Copy link
Collaborator

rabitt commented Sep 13, 2019

For the use case where the data files live on a remote machine and are accessed when needed for e.g. for training a model, many of mirdata's assumptions fail:

  • validation expects files to be local
  • annotation loaders look for files locally
  • annotation loading functions are hidden from the user
  • Track annotation attributes are loaded in the background from disk, expecting files to be present

How can we support this (relatively common) use case cleanly? This will be increasingly important for larger datasets. Some initial ideas:

  • expose annotation loading functions to the user
  • create functions for validating a single file within a dataset
  • better handle errors when accessing Track attributes that load files from disk

cc @faroit

@rabitt rabitt added the question Further information is requested label Sep 13, 2019
@faroit
Copy link

faroit commented Sep 13, 2019

hi there. I didn't really look into mirdata yet but I consider adding MUSDB18.... are does multichannel audio work in mirdata?

How can we support this (relatively common) use case cleanly?

can you elaborate on this use case a bit more and give some examples? are you talking about applications where the data is is in key/value databases?

expose annotation loading functions to the user

Yes, I would do both, the audio loader and the annotation loader.

@rabitt
Copy link
Collaborator Author

rabitt commented Sep 13, 2019

does multichannel audio work in mirdata?

Yes, as of #125 We're just using librosa, which now supports multichannel audio.

can you elaborate on this use case a bit more and give some examples? are you talking about applications where the data is is in key/value databases?

In my case, I'm thinking about whe use case where you want to train using a dataset which is stored remotely. One workflow is to download the audio/annotation files as they're needed and call the loading functions manually. What kind of workflow are you thinking of in the database use case?

@rabitt
Copy link
Collaborator Author

rabitt commented Oct 25, 2019

Me and @magdalenafuentes discussed this a bit more, and here are some things we think would help make this usable for remote datasets:

  1. Add load_audio functions for each loader
  2. Expose annotation and audio paths as attributes in Track objects
  3. Expose all load_* functions, and add documentation for which annotation paths they are built to load
  4. Add an example in the documentation of loading local vs. remote data

@rabitt rabitt added this to the 0.1.0 First Stable Release milestone Nov 2, 2019
@rabitt rabitt self-assigned this Nov 2, 2019
@rabitt rabitt added enhancement New feature or request and removed question Further information is requested labels Nov 2, 2019
@rabitt rabitt removed this from the 0.1.0 First Stable Release milestone Nov 4, 2019
@rabitt rabitt added this to the 0.2.0 milestone Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants