-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide indexing feature to allow for fast sort, join, and group-by operations #1346
Comments
Sure, feel free to make a PR if you find a way of integrating indexes to the existing code. |
This is actually an important issue for me also. I would start with the following observations (if some of them are wrong please correct me - the issue is complex):
Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered. This approach is something we can do in Julia, but is out of reach in other languages that do not have such a rich type system (it has the disadvantage that indexing on multiple columns would be a bit slower than the theoretically optimal performance but would already give us benefits I think). Any thoughts about that? |
They are currently slow if you benchmark to R's data.table, see my discourse post
Good idea. In IndexedTables.jl the indexing is just sorting the data by those keys. This is the approach in data.table as well. I don't see a better way than sorting it. Especially if it's in memory. |
Agreed. What I wanted to say is that we have two separate issues here:
|
Also an issue here is that we could consider having a fast path in |
Would this be similar to pandas' |
No - adding an index to a The point here is that |
I tested IndexedTables' indexing feature and group-by performance is 10x that of DataFrames.jl, so it would be good to introduce indexing into DataFrames.jl
The text was updated successfully, but these errors were encountered: