Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add licenses loaders and doc nits #410

Merged
merged 36 commits into from
Jan 11, 2021
Merged

Add licenses loaders and doc nits #410

merged 36 commits into from
Jan 11, 2021

Conversation

magdalenafuentes
Copy link
Collaborator

@magdalenafuentes magdalenafuentes commented Jan 8, 2021

Closes #397, fixes #382, resolves #396, closes #381

  • Add Dataset.license() with disclaimer message
  • Update all loaders with custom licenses
  • Update example Contributing
  • Update tests
  • Fix nits docs
  • Add DOI in README.md and docs
  • Add citing instructions when working with datasets
  • Change table format
  • Add licenses in table for all datasets

@codecov
Copy link

codecov bot commented Jan 8, 2021

Codecov Report

Merging #410 (393231b) into master (359997f) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
+ Coverage   97.76%   97.78%   +0.01%     
==========================================
  Files          32       32              
  Lines        3446     3476      +30     
==========================================
+ Hits         3369     3399      +30     
  Misses         77       77              

@magdalenafuentes magdalenafuentes linked an issue Jan 8, 2021 that may be closed by this pull request
Copy link
Collaborator

@rabitt rabitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome ❤️

I made a few tiny comments!

@@ -51,7 +51,7 @@ There are two ways of citing mirdata:

If you are using the library for your work, please cite the version you used as indexed at Zenodo:

DOI
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4355859.svg)](https://doi.org/10.5281/zenodo.4355859)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I click the link it takes me to a page that says "Error: DOI not found"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it says it might not have been activated yet (maybe it's too new!) I'll double check in a little bit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reserved it but didn't upload anything in Zenodo yet, that's why is showing like that. It says it will be online when we make the release

@@ -166,7 +176,7 @@ def __init__(self, mtrack_id, data_home):
self._data_home = data_home
# these three attributes below must have exactly these names
self.track_ids = [...] # define which track_ids should be part of the multitrack
self.tracks = {t: Track(t, self._data_home) for t in track_ids}
self.tracks = {t: Track(t, self._data_home) for t in self.track_ids}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 😅

mirdata/core.py Outdated
remotes (dict): data to be downloaded
download_info (str): download instructions or caveats
track (core.Track): an uninstantiated Track object
track_object (mirdata.core.Track or None): an uninstantiated Track object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

track_object is an argument in init (documented there) but not an attribute of the object (which is hidden _track_object)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ i pushed a change that fixes the problem

@@ -190,6 +203,12 @@ def cite(self):
print("========== BibTeX ==========")
print(self.bibtex)

def license(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -383,8 +383,7 @@ def load_extractor(path):

@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
"""The acousticbrainz genre dataset
"""
"""The acousticbrainz genre dataset"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the license info is missing for this dataset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is! Good catch

Copy link
Collaborator

@PRamoneda PRamoneda Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmm This dataset is composed of 4 subdatasets. Three of them are Creative Commons Attribution Non Commercial Share Alike 4.0 International and the other one is non-comercial.

Please include in the justification field your academic affiliation (if you have one) and a brief description of your research topics and why you would like to use this dataset. If you do not include this information we may not approve your request.
The MediaEval AcousticBrainz Genre AllMusic dataset is released for non-commercial scientific research purposes only. Any publication of results based on the data extracts of the AllMusic database must cite AllMusic as the source of the data.
The MediaEval AcousticBrainz Genre AllMusic dataset is offered free of charge for internal non-commercial use only. You may not redistribute, publically communicate or modify it, unless expressly permitted by the Universitat Pompeu Fabra (UPF) or by applicable law. The individual contents of the dataset may be protected by copyright, and a non-transferrable limited license to reproduce and use the same is granted for the indicated purpose only.
The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the UPF is not liable for, and expressly excludes, all liability for loss or damage however and whenever caused to anyone by any use of the MediaEval AcousticBrainz Genre AllMusic dataset or any part of it.

Here the info about this sub-dataset

https://zenodo.org/record/2554044#.X_jiSy3Tqw4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmmm.. ok, will try to be more specific. What about giantestps_tempo? I couldn't find the license

Copy link
Collaborator

@genisplaja genisplaja Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a bit for it and I couldn't find an explicit license as well. Looks like it's Creative Commons Attribution Share Alike 4.0 International like giantsteps_key, as they were compiled and presented together, and data is extracted from the same source.
Here is a link to the ISMIR paper for the two Giantsteps datasets: http://www.cp.jku.at/research/papers/knees_etal_ismir_2015.pdf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PRamoneda can you take a look at the updated license of acousticbrainz and let me know if you agree?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perfect for me!

mirdata/datasets/beatles.py Show resolved Hide resolved
@@ -46,6 +46,8 @@
{}
"""

LICENSE_INFO = "http://creativecommons.org/licenses/by-nc-sa/4.0/."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - for the other creative commons licenses we're writing the name of the license in text but this one is a link. Maybe we want them all to have links?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm.. I used links for those that only provide links in their websites. I'm a bit afraid of not matching exactly the license if I add links to the other ones. Want me to change this link to the name of the license instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't take responsibility for accuracy so it's not horrible if it's inconsistent. I'm fine either way though (to leave as is or to change it)

by the way, there are lists like this with badges - https://gist.github.com/lukas-h/2a5d00690736b4c3a7ba

we'd talked about adding a column to the datasets table with the license badge. is it worth doing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it like is then.

Will take a look at the table now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@genisplaja I see we have a single entry for Saraga. Should we have one for hindustani and another for carnatic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @magdalenafuentes!! I think we should have two entries because there are some differences between the available audio and annotations. I will push the new rows in this PR if you are OK with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Go ahead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!!

@@ -106,7 +106,7 @@ def test_dataset_attributes(httpserver):
clean_remote_dataset(dataset_name)


def test_cite(httpserver):
def test_cite_license(httpserver):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - test_cite_and_license

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would also be good to add a line to test_dataset_attributes to check that dataset.license is set and not None

mirdata/core.py Outdated
@@ -76,13 +84,15 @@ class Dataset(object):

Attributes:
data_home (str): path where mirdata will look for the dataset
index (dict or None): the dataset's file index
Copy link
Collaborator

@rabitt rabitt Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index shouldn't be in attributes, since it's not an attribute! It's a parameter and it's documented there already

mirdata/core.py Outdated
bibtex (str or None): dataset citation/s in bibtex format
remotes (dict or None): data to be downloaded
download_info (str or None): download instructions or caveats
license_info (str or None): license of the dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, since there is no Dataset.license_info attribute, it shouldn't be documented here.

@magdalenafuentes
Copy link
Collaborator Author

@PRamoneda @genisplaja @rabitt let me know what do you think about this format for the table and I’ll complete the rest of the datasets.

a bit of context: adding licenses implies adding long text in the table and formatting it as we we’re doing becomes a nightmare

@@ -0,0 +1,54 @@
.. list-table::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ this is so much better!!

@rabitt
Copy link
Collaborator

rabitt commented Jan 10, 2021

@magdalenafuentes the new way of writing the table is amazing! It's much easier to add to without breaking the whole table like before! 👍 from me!

@rabitt
Copy link
Collaborator

rabitt commented Jan 10, 2021

one thing, and i dont know exactly why it's happening, but look at the built docs for all the dataset.Dataset objects in the Dataset Loaders documentation. The format is getting messed up. I think @drubinstein had the same issue in his PR and found a workaround.

@@ -383,8 +393,7 @@ def load_extractor(path):

@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
"""The acousticbrainz genre dataset
"""
"""The acousticbrainz genre dataset"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magdalenafuentes (thanks to @drubinstein for the fix) the change in formatting you made to the docstring -
"""blablabla""""
vs how it was
"""blablabla
"""
is what's causing the formatting problem in the docs build.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more note from @drubinstein -
black will reformat
"""blablabla
"""
to one line, but it wont reformat:
"""
blablabla
"""
which will build properly too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, good to know, so it should be either """blabla""" or
""""blabla
""""?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly.

"""A"""

Will cause the problem

"""A
"""

or

"""
A
"""

are acceptable solutions. However, black will format solution 1 to look like the version that causes the problem therefore solution 2 is recommended.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm ok, will change things to be that format then

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw thanks @drubinstein for the hint!

@magdalenafuentes
Copy link
Collaborator Author

@genisplaja just a nit: adding multichannel in the second column makes the table quite wide and it doesn't fit in the page anymore (at least for me). Maybe keep it for the dataset info or abbreviate? In any case we should be consistent with other datasets that have multi-channel (either all of them say so or none?)

@genisplaja
Copy link
Collaborator

@genisplaja just a nit: adding multichannel in the second column makes the table quite wide and it doesn't fit in the page anymore (at least for me). Maybe keep it for the dataset info or abbreviate? In any case we should be consistent with other datasets that have multi-channel (either all of them say so or none?)

I see @magdalenafuentes, sorry! In this case, I think that for now, we can remove the multichannel comment here as that's explained in the Saraga Carnatic documentation. We can discuss whether we should add it for all the multi-channel/multi-track datasets when we have more of them. What do you think?

@magdalenafuentes
Copy link
Collaborator Author

@genisplaja just a nit: adding multichannel in the second column makes the table quite wide and it doesn't fit in the page anymore (at least for me). Maybe keep it for the dataset info or abbreviate? In any case we should be consistent with other datasets that have multi-channel (either all of them say so or none?)

I see @magdalenafuentes, sorry! In this case, I think that for now, we can remove the multichannel comment here as that's explained in the Saraga Carnatic documentation. We can discuss whether we should add it for all the multi-channel/multi-track datasets when we have more of them. What do you think?

Perfect! Thanks :)

@rabitt
Copy link
Collaborator

rabitt commented Jan 11, 2021

Merging! great PR :)

@rabitt rabitt merged commit 9e5dd7b into master Jan 11, 2021
@magdalenafuentes magdalenafuentes mentioned this pull request Jan 13, 2021
@rabitt rabitt deleted the license branch February 11, 2021 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants