Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide indexing feature to allow for fast sort, join, and group-by operations #1346

Open
xiaodaigh opened this issue Jan 30, 2018 · 7 comments
Labels
non-breaking The proposed change is not breaking

Comments

@xiaodaigh
Copy link
Contributor

I tested IndexedTables' indexing feature and group-by performance is 10x that of DataFrames.jl, so it would be good to introduce indexing into DataFrames.jl

@nalimilan
Copy link
Member

Sure, feel free to make a PR if you find a way of integrating indexes to the existing code.

@bkamins
Copy link
Member

bkamins commented Mar 5, 2018

This is actually an important issue for me also. I would start with the following observations (if some of them are wrong please correct me - the issue is complex):

  • that I would not expect any of the operations you mention be slow on DataFrame if they are done one-time only (other data structures have to perform the same kind of work anyway - but maybe only one time not every time);
  • the only situation when the above would not be true is because DataFrame is not type stable, but e.g. join uses findrows to overcome this problem; I have not checked all methods in detail if they are careful enough to do it but they seem to do;
  • so we are left with the case of running some operation many times on DataFrame (e.g. joins); as DataFrame is a thin wrapper over AbstractVectors we do not have control if those vectors are updated so there is no way to ensure from DataFrame level that its index is not invalidated by the user
  • some solution (maybe not ideal but could solve some problems) is to add a set of specialized concrete subtypes of something that could be called AbstractIndexedVector{T} that would make sure that when updated with setindex! it would set dirty flag and recalculate its index if needed when lookup is required to the index (the exact API could be discussed). Probably we would need concrete types for indices allowing and disallowing duplicates and special types for categorical vectors. But this should be doable.

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered. This approach is something we can do in Julia, but is out of reach in other languages that do not have such a rich type system (it has the disadvantage that indexing on multiple columns would be a bit slower than the theoretically optimal performance but would already give us benefits I think).

Any thoughts about that?

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented Mar 6, 2018

that I would not expect any of the operations you mention be slow on DataFrame

They are currently slow if you benchmark to R's data.table, see my discourse post

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered.

Good idea. In IndexedTables.jl the indexing is just sorting the data by those keys. This is the approach in data.table as well. I don't see a better way than sorting it. Especially if it's in memory.

@bkamins
Copy link
Member

bkamins commented Mar 6, 2018

Agreed. What I wanted to say is that we have two separate issues here:

  • making the operation fast "one time" (and this in theory should be as fast on DataFrame as on any other data structure - of course there is room for optimizations, e.g. because Julia does not intern strings then either other string type or categorical vectors should be recommended for speed); @xiaodaigh I believe you have a very promising progress in this area; now even sort! for DataFrame on column of Float64 is much slower than order in data.table
  • making an index type (and sorting is a natural approach) - which ensures that repeated calls are fast (then we need to make sure that sorting is fast);

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

Also an issue here is that we could consider having a fast path in groupby/by that could take advantage of knowing that the data frame is sorted already.

@jlumpe
Copy link
Contributor

jlumpe commented Feb 13, 2020

Would this be similar to pandas' DataFrame.index (one per table), or would indices apply to individual columns?

@bkamins
Copy link
Member

bkamins commented Feb 13, 2020

No - adding an index to a DataFrame is achieved via running groupby on it.

The point here is that issorted is much (orders of magnitude) faster than by. Therefore we could run issorted test on grouping columns before grouping. If this test passes then we could potentially use a faster path in groupby/by (as we know that groups will be consecutive continuous blocks of rows in the data frame).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

4 participants