Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294

Closed
magdalenafuentes opened this issue Oct 19, 2020 · 6 comments
Labels
question Further information is requested

Comments

@magdalenafuentes
Copy link
Collaborator

I have the feeling we discussed this already, but as we add more datasets and they are so diverse, I prefer to repeat myself.

Now is difficult to see what a dataset has in terms of annotations, not only the different type of annotations per class, but especially their format. For instance, there are datasets that have track.genre and that is a string, but others return a dictionary with genres and subgenres, and those are lists, and so on.

I think we were leaving this job to the docs, but wouldn't be nice to have some nice printing in the code? And maybe quicker?

@magdalenafuentes magdalenafuentes added the question Further information is requested label Oct 19, 2020
@PRamoneda
Copy link
Collaborator

The idea in the future is to load all datasets (that have a specific feature) with one function. Then, everything should be standardized. The unique genres should be in a single-item list.

@PRamoneda
Copy link
Collaborator

PRamoneda commented Oct 22, 2020

On the other hand, I think we should define how to do the following cases on the same feature:

  • Two sequential annotations (one after the other)

  • Two annotations that are one within the other.

  • Two annotations that are: one or the other

@nkundiushuti
Copy link
Collaborator

the attributes of a Track may be printed with vars or dir
however I agree that we can organize this a bit better:
We can separate attributes:

  • hand-annotated attributes
  • automatically annotated attributes
  • metadata
  • other (self.dummy_variables used in the class for whatever purpose)
    We can include this in the documentation
    In the future we can have self.hand_annotations as a dictionary of annotations etc. But maybe this is overly complex

@rabitt
Copy link
Collaborator

rabitt commented Oct 23, 2020

@magdalenafuentes - what do you mean by printing in the code? So far, you can do dataset.Track? and see the docstring, but do you mean having some explicit logging?

To add a little historical context for @PRamoneda and @nkundiushuti - at the very beginning we discussed fully standardizing the track schema, so that certain attribute keywords would have to follow a standard, e.g. anything called 'sections' would have a particular format. We ended up moving away from this because the dataset are so diverse that it was very hard to anticipate what the one-size-fits all standard would be, even with the datasets we currently have. We decided to opt for a mostly 'free' Track object that can be adapted for the dataset.

That said, I totally agree that we should go for as much consistency as possible now. The long term solution is to slowly move towards using jams, which actually solves these problems quite nicely. However, we find the api a bit difficult to learn/use and we wanted to keep mirdata as accessible as possible, so we haven't used it so far. The plan in #291 is to start standardizing the annotation types a bit in order to set us up for using jams.Annotation objects under the hood (hidden from the user), which will solve a lot of these consistency issues.

@nkundiushuti to add some complication to your proposal, I think even separating annotations that way can get tricky. What if metadata can also be considered a hand-annotated attribute (e.g. tags)? What about semi-automatically annotated attributes? So far in mirdata, every time we made a 'rule', we quickly found an exception! Jams actually has a good solution for this problem with the 'annotation metadata' attribute, which we could make use of. overall, I do think we should move towards jams, but in the most user friendly way possible :)

@PRamoneda

The idea in the future is to load all datasets (that have a specific feature) with one function. Then, everything should be standardized. The unique genres should be in a single-item list.

The plan for multi-dataset loaders does not necessarily need to rely on Track attribute standardization. Our idea to start was to do it semi-manually, to make intentional choices about which tasks/datasets make sense to combine for one task. We wanted to start a bit manual to avoid some common mistakes e.g. "merge all datasets with the .genre attribute, which may merge datasets with different genre definitions or taxonomies and not be very useful for training/evaluation. I'll make a PR proposal soon to see what a common multi-dataset loader can look like, and we can discuss what we all think! I'm opening a separate issue about a multi-dataset loader now so we can discuss, and i'll copy paste the relevant part of this discussion!

p.s. jams will help with the cross-taxonomy problem in the future too!

@nkundiushuti
Copy link
Collaborator

yeah, you convinced me! :) there are plenty of exceptions out-there. jams seems a good compromise to standardisation. in fact, we found some exceptions of our own when trying to use jams! :)

@rabitt
Copy link
Collaborator

rabitt commented Oct 26, 2020

@magdalenafuentes opened #300, so closing this discussion for now!

@rabitt rabitt closed this as completed Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants