Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

magdalenafuentes · 2020-10-19T23:20:00Z

I have the feeling we discussed this already, but as we add more datasets and they are so diverse, I prefer to repeat myself.

Now is difficult to see what a dataset has in terms of annotations, not only the different type of annotations per class, but especially their format. For instance, there are datasets that have track.genre and that is a string, but others return a dictionary with genres and subgenres, and those are lists, and so on.

I think we were leaving this job to the docs, but wouldn't be nice to have some nice printing in the code? And maybe quicker?

The text was updated successfully, but these errors were encountered:

PRamoneda · 2020-10-22T08:52:44Z

The idea in the future is to load all datasets (that have a specific feature) with one function. Then, everything should be standardized. The unique genres should be in a single-item list.

PRamoneda · 2020-10-22T10:15:46Z

On the other hand, I think we should define how to do the following cases on the same feature:

Two sequential annotations (one after the other)
Two annotations that are one within the other.
Two annotations that are: one or the other

nkundiushuti · 2020-10-22T18:22:11Z

the attributes of a Track may be printed with vars or dir
however I agree that we can organize this a bit better:
We can separate attributes:

hand-annotated attributes
automatically annotated attributes
metadata
other (self.dummy_variables used in the class for whatever purpose)
We can include this in the documentation
In the future we can have self.hand_annotations as a dictionary of annotations etc. But maybe this is overly complex

rabitt · 2020-10-23T18:43:15Z

@magdalenafuentes - what do you mean by printing in the code? So far, you can do dataset.Track? and see the docstring, but do you mean having some explicit logging?

To add a little historical context for @PRamoneda and @nkundiushuti - at the very beginning we discussed fully standardizing the track schema, so that certain attribute keywords would have to follow a standard, e.g. anything called 'sections' would have a particular format. We ended up moving away from this because the dataset are so diverse that it was very hard to anticipate what the one-size-fits all standard would be, even with the datasets we currently have. We decided to opt for a mostly 'free' Track object that can be adapted for the dataset.

That said, I totally agree that we should go for as much consistency as possible now. The long term solution is to slowly move towards using jams, which actually solves these problems quite nicely. However, we find the api a bit difficult to learn/use and we wanted to keep mirdata as accessible as possible, so we haven't used it so far. The plan in #291 is to start standardizing the annotation types a bit in order to set us up for using jams.Annotation objects under the hood (hidden from the user), which will solve a lot of these consistency issues.

@nkundiushuti to add some complication to your proposal, I think even separating annotations that way can get tricky. What if metadata can also be considered a hand-annotated attribute (e.g. tags)? What about semi-automatically annotated attributes? So far in mirdata, every time we made a 'rule', we quickly found an exception! Jams actually has a good solution for this problem with the 'annotation metadata' attribute, which we could make use of. overall, I do think we should move towards jams, but in the most user friendly way possible :)

@PRamoneda

The idea in the future is to load all datasets (that have a specific feature) with one function. Then, everything should be standardized. The unique genres should be in a single-item list.

The plan for multi-dataset loaders does not necessarily need to rely on Track attribute standardization. Our idea to start was to do it semi-manually, to make intentional choices about which tasks/datasets make sense to combine for one task. We wanted to start a bit manual to avoid some common mistakes e.g. "merge all datasets with the .genre attribute, which may merge datasets with different genre definitions or taxonomies and not be very useful for training/evaluation. I'll make a PR proposal soon to see what a common multi-dataset loader can look like, and we can discuss what we all think! I'm opening a separate issue about a multi-dataset loader now so we can discuss, and i'll copy paste the relevant part of this discussion!

p.s. jams will help with the cross-taxonomy problem in the future too!

nkundiushuti · 2020-10-26T15:37:28Z

yeah, you convinced me! :) there are plenty of exceptions out-there. jams seems a good compromise to standardisation. in fact, we found some exceptions of our own when trying to use jams! :)

rabitt · 2020-10-26T17:26:31Z

@magdalenafuentes opened #300, so closing this discussion for now!

magdalenafuentes added the question Further information is requested label Oct 19, 2020

magdalenafuentes mentioned this issue Oct 26, 2020

Dataset.info() #300

Closed

rabitt closed this as completed Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

magdalenafuentes commented Oct 19, 2020

PRamoneda commented Oct 22, 2020

PRamoneda commented Oct 22, 2020 •

edited

Loading

nkundiushuti commented Oct 22, 2020

rabitt commented Oct 23, 2020 •

edited

Loading

nkundiushuti commented Oct 26, 2020

rabitt commented Oct 26, 2020

Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

Comments

magdalenafuentes commented Oct 19, 2020

PRamoneda commented Oct 22, 2020

PRamoneda commented Oct 22, 2020 • edited Loading

nkundiushuti commented Oct 22, 2020

rabitt commented Oct 23, 2020 • edited Loading

nkundiushuti commented Oct 26, 2020

rabitt commented Oct 26, 2020

PRamoneda commented Oct 22, 2020 •

edited

Loading

rabitt commented Oct 23, 2020 •

edited

Loading