Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

Closed
wants to merge 44 commits into from

Conversation

lostanlen
Copy link
Collaborator

@lostanlen lostanlen commented Apr 8, 2020

Reduce module-level code to the bare essentials:

  1. README
  2. official name
  3. BibTex reference(s)
  4. download remotes
  5. cached properties
  6. metadata parsing (optional)

Closes #196, closes #197, closes #210, closes #217
Offers a development path towards closing #81, #153, #176, #184, and #196

New features:

  • modules no longer have a to_jams implementation. The to_jams method is shared across modules, and is implemented in the parent class.
  • metadata is now a cached property of Dataset. I re-used the implementation of LargeData so that there's no loss in performance. Dataset construction is still very fast.
  • BibTeX citation is not only printable via cite(), but also accessible as a string via dataset.bibtex
  • The Dataset class overloads __getitem__, so it's still possible to load tracks one by one. I figured that this would be easier to use for newcomers than for them to learn about the Track constructor
  • load_index() now turns the index paths into machine-specific absolute paths. As a result, there is no need to store data_home in the Track object anymore.
  • a new function dataset.choice() picks a Track in the Dataset uniformly at random and loads it.
  • dataset.download now has a verbose flag and has an opt-in for remotes. I called the kwarg download_items, in accordance with Download() refactor #216, but the kwarg seems a bit long in my opinion. Perhaps we can find a shorter name? I'll defer to @magdalenafuentes for this decision.
  • validate is now possible both at the Track level and the Dataset level. Ties down with Check metadata consistency when running validator() #213
  • new utility function from_jams which converts a JAMS Annotation into mirdata namedtuples: BeatData, ChordData, KeyData, SectionData.
  • "fail-safe" mode allows to load some metadata of a Dataset before anything is downloaded (not event a dataset-wide CSV annotation file). This is done via a static method named parse_track_id. I am using this in Guitarset. This ties down with mirdata doesn't work cleanly for datasets not on disk #128 and resolves Throw warning when annotation path does not exist #196 quite nicely. (we can error / warn / pass if metadata is not found).

Demo:

import mirdata.dataset
gset = mirdata.dataset.Dataset(mirdata.guitarset)

# Fail-safe load. This flag will be called differently eventually. I'll let you decide
gset.load(flag196="pass")

# We haven't downloaded anything yet, but we already have some metadata, obtained by parsing track_ids from the JSON index.
print(gset.choice())

# Now download the annotation.
gset.download(download_items=["annotation"])

# Now we have beats, keys, and chords. This is done with the `from_jams` utility
print(gset.choice())

Worth noting that these new features come alongside a ~4x reduction in line count.
(although i haven't removed anything from the previous implementation because i want unit tests to keep working)

The new class is called track2.Track2 for the time being

Covered loaders:

  • beatles
  • gtzan_genre
  • guitarset
  • ikala
  • medley_solos_db
  • medleydb_melody
  • medleydb_pitch
  • orchset
  • rwc_classical
  • rwc_jazz
  • rwc_popular
  • salami
  • tinysol

(i'm not counting DALI because DALI is broken at the moment)

Let me know your thoughts!

@codecov
Copy link

codecov bot commented Apr 8, 2020

Codecov Report

Merging #219 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #219   +/-   ##
=======================================
  Coverage   79.19%   79.19%           
=======================================
  Files          21       21           
  Lines        2442     2442           
=======================================
  Hits         1934     1934           
  Misses        508      508           

@lostanlen lostanlen changed the title [WIP] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate [RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate Apr 9, 2020
@lostanlen lostanlen marked this pull request as draft April 9, 2020 16:54
@lostanlen lostanlen closed this Apr 10, 2020
rabitt pushed a commit that referenced this pull request Oct 20, 2020
@rabitt rabitt mentioned this pull request Oct 20, 2020
8 tasks
rabitt added a commit that referenced this pull request Nov 3, 2020
* Dataset object, heavily inspired by the RFC in #219

* update top-level docs, adapt two loaders

* update dataset api

* update all loaders to fit new API

* remove outdated test

* update tests, inherit dataset-specific load functions, docstring hack, better error handling

* remove data_home from Track docstrings

* normalize dataset_dir to match module name, removes need for DATASET_DIR

* update test_full dataset; fix introduced bug in orchset

* fix bug in orchset download method  #309 

* consolodate track.py and dataset.py into core.py

* create datasets submodule

* fix import bug in tests

* hack around git case sensitiveness

* hack back around git case sensitiveness

* hack around git ignore case changes

* hack back around git ignoring case changes

* fix capitalization in tests paths

* port beatport key to 0.3

Co-authored-by: Rachel Bittner <rachelbittner@spotify.com>
nkundiushuti pushed a commit that referenced this pull request Nov 4, 2020
 
Dataset object (#296)

* Dataset object, heavily inspired by the RFC in #219

* update top-level docs, adapt two loaders

* update dataset api

* update all loaders to fit new API

* remove outdated test

* update tests, inherit dataset-specific load functions, docstring hack, better error handling

* remove data_home from Track docstrings

* normalize dataset_dir to match module name, removes need for DATASET_DIR

* update test_full dataset; fix introduced bug in orchset

* fix bug in orchset download method  #309 

* consolodate track.py and dataset.py into core.py

* create datasets submodule

* fix import bug in tests

* hack around git case sensitiveness

* hack back around git case sensitiveness

* hack around git ignore case changes

* hack back around git ignoring case changes

* fix capitalization in tests paths

* port beatport key to 0.3

Co-authored-by: Rachel Bittner <rachelbittner@spotify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant