Search datasets by annotation type #176

rabitt · 2020-03-05T20:39:44Z

A feature for the future - add to the docs a way to search for all datasets containing a specific annotation type (e.g. BeatData) as a way to find relevant datasets.

The text was updated successfully, but these errors were encountered:

lostanlen · 2020-03-10T16:27:11Z

I started looking into this today, and ... with the current way that mirdata is organized, it's not possible.

At the moment, each mirdata module automatically loads its index during import. If searching for loaders requires to import the modules, then our search function is going to be appalingly slow. Furthermore, when #153 lands, the search function will require downloading loaders from raw.githubusercontent.com, and thus will fail unless the user is connected to the Internet. Clearly, this is not scalable.
There are two ways around it:
(1) mirdata modules no longer load index during import
(2) the search function does not import the module, but just processes them as text files

I personally would be in favor of (1).
In general, running code at import time is being frowned upon: https://www.benkuhn.net/importtime/
Also, we can make this change backwards-compatible, because every mirdata module implements a load function. Under the hood, this function just reads from a global variable (utils.LargeData). But we could perhaps construct this variable during load / validate.

The second goal will be to filter the list by return type: e.g. BeatData. Python is a dynamically typed language, so it is not possible to know for sure what is the return type of every function at compile time. A workaround is to check if the Track object in the module returns a specific property. The problem is that properties are not unified across modules, so it might be good to audit that. (keeping in mind that any unification might break backwards compatibility)

I can also see why we wouldn't want to unify too much. In that case, one thing that we can do is to match properties by string representation.
mirdata.search("melody*") would return

orchset.melody
medleydb_melody.melody1
medleydb_melody.melody2
medleydb_melody.melody3

Would that be a good design pattern?

rabitt · 2020-03-10T17:24:53Z

At the moment, each mirdata module automatically loads its index during import. If searching for loaders requires to import the modules, then our search function is going to be appalingly slow. Furthermore, when #153 lands, the search function will require downloading loaders from raw.githubusercontent.com, and thus will fail unless the user is connected to the Internet. Clearly, this is not scalable.
There are two ways around it:
(1) mirdata modules no longer load index during import

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

(2) the search function does not import the module, but just processes them as text files

If we go this route, maybe it would make sense just to add some "fancy search" functionality to the docs pages? In some ways I'd prefer this - the searching will be done in a one-off fashion, so it doesn't need to be code-callable.

I'll offer an option (3) - in the nicely formatted table of all datasets we'll create (#169 ), we add a filterable column describing the types of data present in the loader, linked to the attribute containing the data. It's a bit of a "manual" solution, but for the use case, I think I prefer it, and it's simple.

lostanlen · 2020-03-10T20:27:12Z

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

Noted. That in itself would warrant its own PR.

If we go this route, maybe it would make sense just to add some "fancy search" functionality to the docs pages? In some ways I'd prefer this - the searching will be done in a one-off fashion, so it doesn't need to be code-callable.

Right. I see both things as complementary though. For labels like genres or playing techniques, it's important to be able to do text search. But for tasks like beat tracking, it would be really cool if the user could query mirdata directly and download+validate+load all datasets implementing a certain attribute (like BeatData in your example). It would a huge win for usability.

I'll offer an option (3) - in the nicely formatted table of all datasets we'll create (#169 ), we add a filterable column describing the types of data present in the loader, linked to the attribute containing the data. It's a bit of a "manual" solution, but for the use case, I think I prefer it, and it's simple.

I'm all for manual solutions, as long as we can support them with unit tests to check for potential mismatches between what is reported in the column and what is actually doable with the datasets.

rabitt · 2020-03-10T23:53:04Z

Edit:

(1) mirdata modules no longer load index during import

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

We actually already lazy load all of the indexes for (I think) all of the modules using the LargeData class!

lostanlen · 2020-03-11T02:57:48Z

brilliant! that's very good. i thought it was just for caching. i understand better now.

Then, do you know what is the bottleneck of import time, if not loading indexes? At the moment, from mirdata import * takes ~3 seconds. That's longer than from librosa import *

Asking because, right now, a way to implement mirdata.search is to:
(1) loop through mirdata.__all__
(2) import the module
(3) build a list of Track attributes via module.Track.__dict__.keys()
(4) iff the pattern (ex. key or beat) is in the list, add it as a match.

but step (2) seems a bit slow right now, and i'm not sure why.

lostanlen · 2020-04-07T10:09:15Z

Update: searching datasets by property is now doable, and very fast.

Here's my code:

import importlib

def search(key):
    match = []
    for submodule_str in mirdata.__all__:
        module_str = "mirdata." + submodule_str
        submodule = importlib.import_module(module_str)
        if key in submodule.Track.__dict__.keys():
            match.append(submodule_str)
    return match

Demo with %timeit search("beats"):

254 µs ± 110 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
['beatles', 'guitarset', 'rwc_classical', 'rwc_jazz', 'rwc_popular']

A few important things to note:

here i am searching by property ("beats"), not type ("BeatData"). Return types are not directly observable from accessing the Track class. This is because Python is a dynamically types language. It's also worth noting that there isn't a one-to-one mapping because property name and annotation type name. For example, medley_db_melody returns multiple F0Data, because it has various definitions of melody: melody1 and melody2.
the function returns a list of strings for module names. It can also be a list of strings for dataset names (by looking up the docstring), or a list of submodules (as Python objects). The former is more print-friendly, the latter is more import-friendly.

I hope this helps

magdalenafuentes added the enhancement New feature or request label Mar 5, 2020

lostanlen mentioned this issue Mar 6, 2020

Add default tests for loaders #186

Merged

lostanlen mentioned this issue Apr 5, 2020

Groove MIDI #207

Merged

13 tasks

This was referenced Apr 8, 2020

[RFC] Dataset class. Cross-module download, duration, from_jams, load, to_jams, and validate #219

Closed

tempo vs bpm #221

Closed

magdalenafuentes mentioned this issue Oct 26, 2020

Dataset.info() #300

Closed

magdalenafuentes mentioned this issue Nov 20, 2020

Task-specific multi-dataset loaders #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search datasets by annotation type #176

Search datasets by annotation type #176

rabitt commented Mar 5, 2020

lostanlen commented Mar 10, 2020

rabitt commented Mar 10, 2020

lostanlen commented Mar 10, 2020 •

edited

Loading

rabitt commented Mar 10, 2020

lostanlen commented Mar 11, 2020

lostanlen commented Apr 7, 2020

Search datasets by annotation type #176

Search datasets by annotation type #176

Comments

rabitt commented Mar 5, 2020

lostanlen commented Mar 10, 2020

rabitt commented Mar 10, 2020

lostanlen commented Mar 10, 2020 • edited Loading

rabitt commented Mar 10, 2020

lostanlen commented Mar 11, 2020

lostanlen commented Apr 7, 2020

lostanlen commented Mar 10, 2020 •

edited

Loading