-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficult to see what annotations a dataset has in the code, and what to expect from track in some cases #294
Comments
The idea in the future is to load all datasets (that have a specific feature) with one function. Then, everything should be standardized. The unique genres should be in a single-item list. |
On the other hand, I think we should define how to do the following cases on the same feature:
|
the attributes of a Track may be printed with vars or dir
|
@magdalenafuentes - what do you mean by printing in the code? So far, you can do To add a little historical context for @PRamoneda and @nkundiushuti - at the very beginning we discussed fully standardizing the track schema, so that certain attribute keywords would have to follow a standard, e.g. anything called 'sections' would have a particular format. We ended up moving away from this because the dataset are so diverse that it was very hard to anticipate what the one-size-fits all standard would be, even with the datasets we currently have. We decided to opt for a mostly 'free' Track object that can be adapted for the dataset. That said, I totally agree that we should go for as much consistency as possible now. The long term solution is to slowly move towards using jams, which actually solves these problems quite nicely. However, we find the api a bit difficult to learn/use and we wanted to keep mirdata as accessible as possible, so we haven't used it so far. The plan in #291 is to start standardizing the annotation types a bit in order to set us up for using jams.Annotation objects under the hood (hidden from the user), which will solve a lot of these consistency issues. @nkundiushuti to add some complication to your proposal, I think even separating annotations that way can get tricky. What if metadata can also be considered a hand-annotated attribute (e.g. tags)? What about semi-automatically annotated attributes? So far in mirdata, every time we made a 'rule', we quickly found an exception! Jams actually has a good solution for this problem with the 'annotation metadata' attribute, which we could make use of. overall, I do think we should move towards jams, but in the most user friendly way possible :)
The plan for multi-dataset loaders does not necessarily need to rely on Track attribute standardization. Our idea to start was to do it semi-manually, to make intentional choices about which tasks/datasets make sense to combine for one task. We wanted to start a bit manual to avoid some common mistakes e.g. "merge all datasets with the p.s. jams will help with the cross-taxonomy problem in the future too! |
yeah, you convinced me! :) there are plenty of exceptions out-there. jams seems a good compromise to standardisation. in fact, we found some exceptions of our own when trying to use jams! :) |
@magdalenafuentes opened #300, so closing this discussion for now! |
I have the feeling we discussed this already, but as we add more datasets and they are so diverse, I prefer to repeat myself.
Now is difficult to see what a dataset has in terms of annotations, not only the different type of annotations per class, but especially their format. For instance, there are datasets that have
track.genre
and that is a string, but others return a dictionary with genres and subgenres, and those are lists, and so on.I think we were leaving this job to the docs, but wouldn't be nice to have some nice printing in the code? And maybe quicker?
The text was updated successfully, but these errors were encountered: