Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search datasets by annotation type #176

Open
rabitt opened this issue Mar 5, 2020 · 6 comments
Open

Search datasets by annotation type #176

rabitt opened this issue Mar 5, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@rabitt
Copy link
Collaborator

rabitt commented Mar 5, 2020

A feature for the future - add to the docs a way to search for all datasets containing a specific annotation type (e.g. BeatData) as a way to find relevant datasets.

@magdalenafuentes magdalenafuentes added the enhancement New feature or request label Mar 5, 2020
@lostanlen
Copy link
Collaborator

I started looking into this today, and ... with the current way that mirdata is organized, it's not possible.

At the moment, each mirdata module automatically loads its index during import. If searching for loaders requires to import the modules, then our search function is going to be appalingly slow. Furthermore, when #153 lands, the search function will require downloading loaders from raw.githubusercontent.com, and thus will fail unless the user is connected to the Internet. Clearly, this is not scalable.
There are two ways around it:
(1) mirdata modules no longer load index during import
(2) the search function does not import the module, but just processes them as text files

I personally would be in favor of (1).
In general, running code at import time is being frowned upon: https://www.benkuhn.net/importtime/
Also, we can make this change backwards-compatible, because every mirdata module implements a load function. Under the hood, this function just reads from a global variable (utils.LargeData). But we could perhaps construct this variable during load / validate.

The second goal will be to filter the list by return type: e.g. BeatData. Python is a dynamically typed language, so it is not possible to know for sure what is the return type of every function at compile time. A workaround is to check if the Track object in the module returns a specific property. The problem is that properties are not unified across modules, so it might be good to audit that. (keeping in mind that any unification might break backwards compatibility)

I can also see why we wouldn't want to unify too much. In that case, one thing that we can do is to match properties by string representation.
mirdata.search("melody*") would return

  • orchset.melody
  • medleydb_melody.melody1
  • medleydb_melody.melody2
  • medleydb_melody.melody3

Would that be a good design pattern?

@rabitt
Copy link
Collaborator Author

rabitt commented Mar 10, 2020

At the moment, each mirdata module automatically loads its index during import. If searching for loaders requires to import the modules, then our search function is going to be appalingly slow. Furthermore, when #153 lands, the search function will require downloading loaders from raw.githubusercontent.com, and thus will fail unless the user is connected to the Internet. Clearly, this is not scalable.
There are two ways around it:
(1) mirdata modules no longer load index during import

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

(2) the search function does not import the module, but just processes them as text files

If we go this route, maybe it would make sense just to add some "fancy search" functionality to the docs pages? In some ways I'd prefer this - the searching will be done in a one-off fashion, so it doesn't need to be code-callable.

I'll offer an option (3) - in the nicely formatted table of all datasets we'll create (#169 ), we add a filterable column describing the types of data present in the loader, linked to the attribute containing the data. It's a bit of a "manual" solution, but for the use case, I think I prefer it, and it's simple.

@lostanlen
Copy link
Collaborator

lostanlen commented Mar 10, 2020

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

Noted. That in itself would warrant its own PR.

If we go this route, maybe it would make sense just to add some "fancy search" functionality to the docs pages? In some ways I'd prefer this - the searching will be done in a one-off fashion, so it doesn't need to be code-callable.

Right. I see both things as complementary though. For labels like genres or playing techniques, it's important to be able to do text search. But for tasks like beat tracking, it would be really cool if the user could query mirdata directly and download+validate+load all datasets implementing a certain attribute (like BeatData in your example). It would a huge win for usability.

I'll offer an option (3) - in the nicely formatted table of all datasets we'll create (#169 ), we add a filterable column describing the types of data present in the loader, linked to the attribute containing the data. It's a bit of a "manual" solution, but for the use case, I think I prefer it, and it's simple.

I'm all for manual solutions, as long as we can support them with unit tests to check for potential mismatches between what is reported in the column and what is actually doable with the datasets.

@rabitt
Copy link
Collaborator Author

rabitt commented Mar 10, 2020

Edit:

(1) mirdata modules no longer load index during import

That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.

We actually already lazy load all of the indexes for (I think) all of the modules using the LargeData class!

@lostanlen
Copy link
Collaborator

brilliant! that's very good. i thought it was just for caching. i understand better now.

Then, do you know what is the bottleneck of import time, if not loading indexes? At the moment, from mirdata import * takes ~3 seconds. That's longer than from librosa import *

Asking because, right now, a way to implement mirdata.search is to:
(1) loop through mirdata.__all__
(2) import the module
(3) build a list of Track attributes via module.Track.__dict__.keys()
(4) iff the pattern (ex. key or beat) is in the list, add it as a match.

but step (2) seems a bit slow right now, and i'm not sure why.

@lostanlen lostanlen mentioned this issue Apr 5, 2020
13 tasks
@lostanlen
Copy link
Collaborator

Update: searching datasets by property is now doable, and very fast.

Here's my code:

import importlib

def search(key):
    match = []
    for submodule_str in mirdata.__all__:
        module_str = "mirdata." + submodule_str
        submodule = importlib.import_module(module_str)
        if key in submodule.Track.__dict__.keys():
            match.append(submodule_str)
    return match

Demo with %timeit search("beats"):

254 µs ± 110 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
['beatles', 'guitarset', 'rwc_classical', 'rwc_jazz', 'rwc_popular']

A few important things to note:

  • here i am searching by property ("beats"), not type ("BeatData"). Return types are not directly observable from accessing the Track class. This is because Python is a dynamically types language. It's also worth noting that there isn't a one-to-one mapping because property name and annotation type name. For example, medley_db_melody returns multiple F0Data, because it has various definitions of melody: melody1 and melody2.

  • the function returns a list of strings for module names. It can also be a list of strings for dataset names (by looking up the docstring), or a list of submodules (as Python objects). The former is more print-friendly, the latter is more import-friendly.

I hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants