-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search datasets by annotation type #176
Comments
I started looking into this today, and ... with the current way that mirdata is organized, it's not possible. At the moment, each mirdata module automatically loads its index during import. If searching for loaders requires to import the modules, then our search function is going to be appalingly slow. Furthermore, when #153 lands, the search function will require downloading loaders from I personally would be in favor of (1). The second goal will be to filter the list by return type: e.g. I can also see why we wouldn't want to unify too much. In that case, one thing that we can do is to match properties by string representation.
Would that be a good design pattern? |
That's totally feasible - we already do this for the "LargeData" object - we could tread the indexes the same. Independently of this issue, I think we should do this.
If we go this route, maybe it would make sense just to add some "fancy search" functionality to the docs pages? In some ways I'd prefer this - the searching will be done in a one-off fashion, so it doesn't need to be code-callable. I'll offer an option (3) - in the nicely formatted table of all datasets we'll create (#169 ), we add a filterable column describing the types of data present in the loader, linked to the attribute containing the data. It's a bit of a "manual" solution, but for the use case, I think I prefer it, and it's simple. |
Noted. That in itself would warrant its own PR.
Right. I see both things as complementary though. For labels like genres or playing techniques, it's important to be able to do text search. But for tasks like beat tracking, it would be really cool if the user could query mirdata directly and download+validate+load all datasets implementing a certain attribute (like
I'm all for manual solutions, as long as we can support them with unit tests to check for potential mismatches between what is reported in the column and what is actually doable with the datasets. |
Edit:
We actually already lazy load all of the indexes for (I think) all of the modules using the LargeData class! |
brilliant! that's very good. i thought it was just for caching. i understand better now. Then, do you know what is the bottleneck of import time, if not loading indexes? At the moment, Asking because, right now, a way to implement but step (2) seems a bit slow right now, and i'm not sure why. |
Update: searching datasets by property is now doable, and very fast. Here's my code:
Demo with
A few important things to note:
I hope this helps |
A feature for the future - add to the docs a way to search for all datasets containing a specific annotation type (e.g.
BeatData
) as a way to find relevant datasets.The text was updated successfully, but these errors were encountered: