Provide indexing feature to allow for fast sort, join, and group-by operations #1346

xiaodaigh · 2018-01-30T02:05:58Z

I tested IndexedTables' indexing feature and group-by performance is 10x that of DataFrames.jl, so it would be good to introduce indexing into DataFrames.jl

nalimilan · 2018-01-30T09:42:58Z

Sure, feel free to make a PR if you find a way of integrating indexes to the existing code.

bkamins · 2018-03-05T23:22:33Z

This is actually an important issue for me also. I would start with the following observations (if some of them are wrong please correct me - the issue is complex):

that I would not expect any of the operations you mention be slow on DataFrame if they are done one-time only (other data structures have to perform the same kind of work anyway - but maybe only one time not every time);
the only situation when the above would not be true is because DataFrame is not type stable, but e.g. join uses findrows to overcome this problem; I have not checked all methods in detail if they are careful enough to do it but they seem to do;
so we are left with the case of running some operation many times on DataFrame (e.g. joins); as DataFrame is a thin wrapper over AbstractVectors we do not have control if those vectors are updated so there is no way to ensure from DataFrame level that its index is not invalidated by the user
some solution (maybe not ideal but could solve some problems) is to add a set of specialized concrete subtypes of something that could be called AbstractIndexedVector{T} that would make sure that when updated with setindex! it would set dirty flag and recalculate its index if needed when lookup is required to the index (the exact API could be discussed). Probably we would need concrete types for indices allowing and disallowing duplicates and special types for categorical vectors. But this should be doable.

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered. This approach is something we can do in Julia, but is out of reach in other languages that do not have such a rich type system (it has the disadvantage that indexing on multiple columns would be a bit slower than the theoretically optimal performance but would already give us benefits I think).

Any thoughts about that?

xiaodaigh · 2018-03-06T05:53:53Z

that I would not expect any of the operations you mention be slow on DataFrame

They are currently slow if you benchmark to R's data.table, see my discourse post

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered.

Good idea. In IndexedTables.jl the indexing is just sorting the data by those keys. This is the approach in data.table as well. I don't see a better way than sorting it. Especially if it's in memory.

bkamins · 2018-03-06T11:19:22Z

Agreed. What I wanted to say is that we have two separate issues here:

making the operation fast "one time" (and this in theory should be as fast on DataFrame as on any other data structure - of course there is room for optimizations, e.g. because Julia does not intern strings then either other string type or categorical vectors should be recommended for speed); @xiaodaigh I believe you have a very promising progress in this area; now even sort! for DataFrame on column of Float64 is much slower than order in data.table
making an index type (and sorting is a natural approach) - which ensures that repeated calls are fast (then we need to make sure that sorting is fast);

bkamins · 2020-02-12T12:59:10Z

Also an issue here is that we could consider having a fast path in groupby/by that could take advantage of knowing that the data frame is sorted already.

jlumpe · 2020-02-13T01:14:53Z

Would this be similar to pandas' DataFrame.index (one per table), or would indices apply to individual columns?

bkamins · 2020-02-13T10:10:56Z

No - adding an index to a DataFrame is achieved via running groupby on it.

The point here is that issorted is much (orders of magnitude) faster than by. Therefore we could run issorted test on grouping columns before grouping. If this test passes then we could potentially use a faster path in groupby/by (as we know that groups will be consecutive continuous blocks of rows in the data frame).

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide indexing feature to allow for fast sort, join, and group-by operations #1346

Provide indexing feature to allow for fast sort, join, and group-by operations #1346

xiaodaigh commented Jan 30, 2018

nalimilan commented Jan 30, 2018

bkamins commented Mar 5, 2018

xiaodaigh commented Mar 6, 2018 •

edited

Loading

bkamins commented Mar 6, 2018

bkamins commented Feb 12, 2020

jlumpe commented Feb 13, 2020

bkamins commented Feb 13, 2020

Provide indexing feature to allow for fast sort, join, and group-by operations #1346

Provide indexing feature to allow for fast sort, join, and group-by operations #1346

Comments

xiaodaigh commented Jan 30, 2018

nalimilan commented Jan 30, 2018

bkamins commented Mar 5, 2018

xiaodaigh commented Mar 6, 2018 • edited Loading

bkamins commented Mar 6, 2018

bkamins commented Feb 12, 2020

jlumpe commented Feb 13, 2020

bkamins commented Feb 13, 2020

xiaodaigh commented Mar 6, 2018 •

edited

Loading