[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

lostanlen · 2020-04-08T04:24:32Z

Reduce module-level code to the bare essentials:

README
official name
BibTex reference(s)
download remotes
cached properties
metadata parsing (optional)

Closes #196, closes #197, closes #210, closes #217
Offers a development path towards closing #81, #153, #176, #184, and #196

New features:

modules no longer have a to_jams implementation. The to_jams method is shared across modules, and is implemented in the parent class.
metadata is now a cached property of Dataset. I re-used the implementation of LargeData so that there's no loss in performance. Dataset construction is still very fast.
BibTeX citation is not only printable via cite(), but also accessible as a string via dataset.bibtex
The Dataset class overloads __getitem__, so it's still possible to load tracks one by one. I figured that this would be easier to use for newcomers than for them to learn about the Track constructor
load_index() now turns the index paths into machine-specific absolute paths. As a result, there is no need to store data_home in the Track object anymore.
a new function dataset.choice() picks a Track in the Dataset uniformly at random and loads it.
dataset.download now has a verbose flag and has an opt-in for remotes. I called the kwarg download_items, in accordance with Download() refactor #216, but the kwarg seems a bit long in my opinion. Perhaps we can find a shorter name? I'll defer to @magdalenafuentes for this decision.
validate is now possible both at the Track level and the Dataset level. Ties down with Check metadata consistency when running validator() #213
new utility function from_jams which converts a JAMS Annotation into mirdata namedtuples: BeatData, ChordData, KeyData, SectionData.
"fail-safe" mode allows to load some metadata of a Dataset before anything is downloaded (not event a dataset-wide CSV annotation file). This is done via a static method named parse_track_id. I am using this in Guitarset. This ties down with mirdata doesn't work cleanly for datasets not on disk #128 and resolves Throw warning when annotation path does not exist #196 quite nicely. (we can error / warn / pass if metadata is not found).

Demo:

import mirdata.dataset
gset = mirdata.dataset.Dataset(mirdata.guitarset)

# Fail-safe load. This flag will be called differently eventually. I'll let you decide
gset.load(flag196="pass")

# We haven't downloaded anything yet, but we already have some metadata, obtained by parsing track_ids from the JSON index.
print(gset.choice())

# Now download the annotation.
gset.download(download_items=["annotation"])

# Now we have beats, keys, and chords. This is done with the `from_jams` utility
print(gset.choice())

Worth noting that these new features come alongside a ~4x reduction in line count.
(although i haven't removed anything from the previous implementation because i want unit tests to keep working)

The new class is called track2.Track2 for the time being

Covered loaders:

(i'm not counting DALI because DALI is broken at the moment)

ensure backwards compatibility
ensure that test coverage increases
ensure that the LOC net count is negative
bikeshed the name of the "failsafe" flag (issue Throw warning when annotation path does not exist #196). I called it flag196 temporarily

Let me know your thoughts!

codecov · 2020-04-08T04:27:56Z

Codecov Report

Merging #219 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #219   +/-   ##
=======================================
  Coverage   79.19%   79.19%           
=======================================
  Files          21       21           
  Lines        2442     2442           
=======================================
  Hits         1934     1934           
  Misses        508      508

* Dataset object, heavily inspired by the RFC in #219 * update top-level docs, adapt two loaders * update dataset api * update all loaders to fit new API * remove outdated test * update tests, inherit dataset-specific load functions, docstring hack, better error handling * remove data_home from Track docstrings * normalize dataset_dir to match module name, removes need for DATASET_DIR * update test_full dataset; fix introduced bug in orchset * fix bug in orchset download method #309 * consolodate track.py and dataset.py into core.py * create datasets submodule * fix import bug in tests * hack around git case sensitiveness * hack back around git case sensitiveness * hack around git ignore case changes * hack back around git ignoring case changes * fix capitalization in tests paths * port beatport key to 0.3 Co-authored-by: Rachel Bittner <rachelbittner@spotify.com>

Dataset object (#296) * Dataset object, heavily inspired by the RFC in #219 * update top-level docs, adapt two loaders * update dataset api * update all loaders to fit new API * remove outdated test * update tests, inherit dataset-specific load functions, docstring hack, better error handling * remove data_home from Track docstrings * normalize dataset_dir to match module name, removes need for DATASET_DIR * update test_full dataset; fix introduced bug in orchset * fix bug in orchset download method #309 * consolodate track.py and dataset.py into core.py * create datasets submodule * fix import bug in tests * hack around git case sensitiveness * hack back around git case sensitiveness * hack around git ignore case changes * hack back around git ignoring case changes * fix capitalization in tests paths * port beatport key to 0.3 Co-authored-by: Rachel Bittner <rachelbittner@spotify.com>

Vincent Lostanlen added 30 commits April 7, 2020 23:20

create Dataset class

3719649

create track2.py

cf10e5b

implement to_jams in track2

277f9ad

implement Dataset.download

87c1bb0

rwc_classical

1ee3711

orchset

65beab0

change prototype of load_track to have paths second, metadata third

dcf5dfe

update Track2 superclass

a08b816

address case where track_metadata is empty

313f42e

implement duration, from_jams, to_jams

9772995

remove Track constructors of orchset and rwcc!

db178ff

GTZAN Genre

bea3f77

create kwarg download_items in dataset.download

87fe788

define beats, chords, keys in from_jams

a16f0c9

do not store track_metadata in Track superclass

bebcf56

call jams load in the parent class if "jams" in track_index

978000a

exclude jams et al. from Track __repr__

813d068

GuitarSet

829f547

sections of rwc classical

43e1bde

implement sections and chords in to_jams

80f6dc2

implement dataset.choice()

b555c34

cite multiple papers if bibtex is dict

5de0ae7

print remote if isinstance(remote, str)

f7dcccb

unify rwc_classical and rwc_jazz beat and structure parsers

39aaec1

rwc_jazz

e505f58

rwc_popular

11f71ee

format string remote according to data_home

cf30af6

encode title and artist In jams

3b81b33

bugfix print remote

0000c41

beatles.Track

8db5b49

black -S .

75d668e

Vincent Lostanlen added 9 commits April 8, 2020 07:11

document beatles.artist

873ff49

allow for missing fields in index

e7cc3b5

medley_solos_db

bb8df77

salami

d4dffd7

write failsafe mode (flag196), parse_track_id

f298369

implement parse_track_id in track2.py

3015442

keep a list of previously printed remotes

12eb3b1

ikala

b2b83cc

black -S .

36bf88e

rabitt mentioned this pull request Apr 8, 2020

What information can be abstracted to the track.Track object? #224

Closed

Vincent Lostanlen added 2 commits April 8, 2020 20:33

duration should not be cached

d425283

beatles.artist should be cached

07e55df

This was referenced Apr 8, 2020

Should dataset modules have a common Dataset class? #225

Closed

dataset.choice #228

Closed

Vincent Lostanlen added 2 commits April 8, 2020 21:30

bugfix duration decorator

b3ea8de

experiment with to_jams(duration, self)

2678eef

lostanlen mentioned this pull request Apr 9, 2020

module.load_* should raise IOError, not return None #222

Closed

lostanlen changed the title ~~[WIP] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate~~ [RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate Apr 9, 2020

lostanlen marked this pull request as draft April 9, 2020 16:54

lostanlen closed this Apr 10, 2020

rabitt pushed a commit that referenced this pull request Oct 20, 2020

Dataset object, heavily inspired by the RFC in #219

a7c3d03

rabitt mentioned this pull request Oct 20, 2020

Dataset object #296

Merged

8 tasks

magdalenafuentes mentioned this pull request Oct 20, 2020

Generalize copy-paste dataset functions to utils? #293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

lostanlen commented Apr 8, 2020 •

edited

Loading

codecov bot commented Apr 8, 2020 •

edited

Loading

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

Conversation

lostanlen commented Apr 8, 2020 • edited Loading

codecov bot commented Apr 8, 2020 • edited Loading

Codecov Report

lostanlen commented Apr 8, 2020 •

edited

Loading

codecov bot commented Apr 8, 2020 •

edited

Loading