-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base DataFrame on NamedTuple #1335
Comments
Can't you just rely on the fact that |
The way the code is currently written, we sometimes need to convert all indices from either symbols or integers to a common representation since the function accepts both, and we don't want to write it twice. The lookup operation should be fast, so unless |
Wait, are talking about cases where the user is providing both symbols and integers? Why even allow that? It seems totally reasonable to just say, use symbols, or use integers, but you can't mix both. |
No, I mean that functions allow passing either symbols or integer, but internally they work with only one type of index. For example, that's the case of the |
@nalimilan I would help - but because of the size of the PR we have to discuss how to work with it. I would start with the performance - this relates to the question if it is worth to switch to Field access in This means that we get type stability of the return value of the function: at the cost of performance. To give you the scale. Accessing column of If we accept this penalty then we do not need internal |
The interest of using In general, the cost of looking up a column should never be significant in the computing time, or the code is not designed correctly. Indeed for efficient operation you really need to work directly with column vectors. That said, it's always good to make an operation as efficient as possible, so I think we should keep |
OK. I have done some more careful tests - earlier I was mixing different things when timing. Conclusions:
As a comment I think we should target the performance up to 100 000 number of columns, as those are the numbers of explanatory variables you can meet in practice in modeling in analytical data set. But this judgement is conditional on target audience of DataFrames package - in particular how it should be positioned against JuliaDB (for what usage scenarios which package should be recommended). I am not 100% clear about this. Benchmark results below: On 0.6.2:
On 0.7:
but for 20000 columns (which is not unthinkable):
The same on 0.6.2 is probably unacceptable:
|
Don't worry about 0.6.2, we already have enough work on our plate. Also, note that most of the time is only needed the first time the type is created, i.e. the first time a table with specific column names and types is encountered. The second call is very fast. |
I know, but the problem with As a side note - your current proposed implementation is for 0.6.2 and will not work on 0.7. |
Yes. But 3 seconds for 5000 columns isn't too bad, anyway with that many columns many operations are going to be slow. For 500 columns it's already much more reasonable.
The best person is probably Jeff, I can ask him on Slack.
Yes, unfortunately I started working on that before |
BTW, Jameson says we'd better not use
|
@nalimilan Did Jameson say why that would be a better option? Just curios. Or is that discussion somewhere in the open so that I can read up on it? |
Not really |
I do not know Jameson's reason, but it seems to me that working this way would allow to create the required types faster. The question is how Additionally given discussion in #1337 I have (again - sorry) a question for the reasons we need a
Is this the line of thinking along which we want to go? If yes there are in general two possible designs:
I understand that @nalimilan implemented the second option in his proposal. And I think it is better if we are sure that (I have not benchmarked it all):
|
I think whether or not we use Your summary sounds good. The reason why I decided to implement It also means it is cheap to work on the Regarding The case of small Anyway, the decision to base |
Thank you. Your comment actually clarified my thinking how I would approach this change in development process. I would do the following to split the work:
Looking at your branch you have more or less both steps covered. However, at least for me splitting it to two stages would simplify work on it. Additionally - just to clarify - did you plan to expose |
I think it makes sense to do both steps as the same time, because most of the work consists in moving functions from dataframe.jl to typeddataframe.jl, adapting them, and writing a wrapper function in dataframe.jl to call the I think I didn't export |
OK. I start with a detailed look at |
OK. The strategy is open for discussion of course, as this is very exploratory. We can change it if it turns out it doesn't work well. There's also a question to investigate regarding the possible relationship with IndexedTable and it's |
And has the problems I have encountered in my recent investigations on 0.6.2 like:
(probably this will not happen on 0.7 but the current implementation does not support 0.7 yet). |
I don't have a clear idea about the relationship with JuliaDB yet. We should discuss this with JuliaDB people. A first idea which comes to mind would be to have |
I am asking because - at least for me - it is crucial for design decisions (i.e. to concentrate on excelling at the important use-cases of the given type) and to make sure that the efforts are coordinated. My thinking was the following (but I might be missing something important here):
|
We are yet to port IndexedTables to Julia 0.7. I think that would be my first step in debugging JuliaData/IndexedTables.jl#81 ... Switching to Tuples/NamedTuples as a representation of rows will have this problem at least on 0.6. But I think since datasets only have a handful of distinct types it is possible to create large tuple types that have some kind of "compressed" type representation.
This is not necessarily expensive since a table is a cheap wrapper around data. The data will only be copied if you mutate / add / remove an indexed column. |
@shashi Thank you for the response. The "compressed" type representation seems very promising the challenge is probably to ensure return type inference when allowing indexing both by column number or column name. Regarding table mutation performance - the question is what would be time complexity of operation like:
My fear was that this will be O(n^2). Look at a typical scenario (during last 20 years I have been in hundreds of projects where something like this was required, especially in exploratory data analysis, where you do not know upfront which variables and what transformations you want):
And now timing (I use
another example with column renaming (a bit less extreme and artificial but rooted in the same problems):
and timing:
This time complexity is linear but simply the operation is much more expensive. In consequence there are the following possibilities:
What I want to avoid is the pain currently people have with plotting times (and for plots I believe that this is solvable with precompilation, but here I am not 100% sure it is possible as each If we want Julia ecosystem to be appealing to a casual user (which I believe is crucial for large-scale adoption) we have to make operations as above fast in interactive use. |
Bummer. Most slowdown in those examples seems to come from compilation. There's a type in JuliaDB named julia> function test_coldict(lead)
tbl = table([1], names=[Symbol(lead*"1")])
cdict = ColDict(tbl)
for i in 2:100
cdict[Symbol(lead*"$i")] = [i]
end
cdict[]
end
test_coldict (generic function with 1 method)
julia> @time test_coldict("b");
0.022794 seconds (3.46 k allocations: 336.899 KiB) julia> function n_coldict()
t = table([0.01, 0.05], names=[:t1]); d = ColDict(t);
for i in 1:100
old = Symbol("t"*"$i")
new = Symbol("t"*"$(i+1)")
IndexedTables.rename!(d, old, new)
end
d[]
end
n_coldict (generic function with 1 method)
julia> @time n_coldict()
0.016808 seconds (3.51 k allocations: 178.416 KiB)
It would be nice if one day Edit: update timing to be of first run |
Excellent! I was not aware of The issues are (but they are solvable I suppose):
The only difference is that @nalimilan - any thoughts on this? |
Updated the timings to reflect compilation overhead. I think it would be good to document the |
I have discussed this with @quinnj once. One path for merging the JuliaDB and DataFrames effort is this:
|
It's great to have this discussion. A few comments @shashi's points:
Regarding the performance of adding/renaming columns, in think we should benchmark the different possible solutions in 0.7 (no need to care about 0.6 at this point) to see whether creating the underlying |
Columns is an iterable of named tuples but internally holds a named tuple of columns. It's at best 100 lines of wrapper. I think the main benefit is being able implement operations using AbstractArray interface. e.g. iteration,
It would, for now the point is keeping the kernels in a common place that other packages can use. Jacob has been prototyping this in DataStreams as well -- although I think it could be cleanly moved to a different package which depends on DataStreams.
Not much of a difference except
I think |
That's what TypedTables.jl does. It works ok. |
Yes, the challenge is syntax, as you have to use |
@vtjnash Why is this?
Maybe constant propagation would make that unnecessary? |
Continuing from #1643: we can do either what you propose:
or exactly the opposite - do not store it as a part of the type (this is what I was thinking of). Then we can freely add/remove/change type of columns in the |
@nalimilan I have investigated the current status of
It seems we are hitting some wall around 2800 columns in this case. Not sure why it starts exploding. My simple conclusion from this (maybe oversimplified) is the following (given our recent discussions):
In this way we would be only one-step (one barier function) away from a type stable |
Interesting. But even for 100 columns, isn't the overhead already quite large? Some operations are really fast currently, adding a half a second delay doesn't sound great. |
I agree. But I assume that we would not be materializing
Alternatively we could use some threshold value to decide when to do automatic |
Yeah, it would be interesting to experiment with adding a type parameter which would default to |
Actually I would not touch these standard "cheap" functions at all, as the effect of having
The only significant problem with this approach I see is that |
Actually what I suggested was that by storing column types in a type parameter, we could make a function such as |
My point was that this type stability is not something we will get easily anyway:
From my experience it is only possible to get if what you index with is a constant that is known at compilation time, so you need a function barier with a dynamic dispatch anyway (unless I do not understand something here). My idea was rather to have |
My understanding is that constant propagation might make julia> const nt = (a=1,b=1.0,c=true,d="1",e='1')
(a = 1, b = 1.0, c = true, d = "1", e = '1')
julia> f(nt, symb) = nt[symb]
f (generic function with 1 method)
julia> g() = f(nt, :b)
g (generic function with 1 method)
julia> @code_warntype g()
Body::Float64
1 1 ─ return 1.0 │ I wish there were clear guidelines about when it can be relied upon. |
Yes - and this is exactly what I want to say. More generally:
In both cases you have one dynamic dispatch and might hit a precompilation lag.
From my experience the two behaviors above work reliably (although @nalimilan noted that sometimes, when inlining happens, constant propagation might not be performed - but in my experience I did not have a situation when compiler performed inlining in such cases). |
What's the current status of this project? |
That's on hold, at least until we have tagged 1.0. |
@nalimilan - do you believe there will be one day when we go back to this discussion? If not I would close it. I think that the current consensus is that DataFrames.jl will just stay type unstable. |
Yes we're not going to do that in the 1.x series anyway. |
Now that
NamedTuple
is in Julia Base (on 0.7), we should use it in DataFrames. A possible approach which has been discussed on Slack would be to define a newTypedDataFrame{C<:NamedTuple} <: AbstractDataFrame
type, which would just wrap aNamedTuple
, and include information regarding the column names and types in its type parameterC
. ThenDataFrame
would be a convenience type wrapping aTypedDataFrame
object, with the advantages that 1) columns can be added/removed/renamed, and 2) functions do not need to be compiled for each combination of column types when it doesn't improve performance.TypedDataFrame
would be useful to improve performance where the user provides a custom function operating on each row (notablygroupby
), as specialization is essential for performance in that case.I have experimented this approach in the nl/typed branch, using the NamedTuples.jl package (since I used Julia 0.6 at first). It's mostly working (and passes some tests), but lots of fixes are needed to pass all tests. In particular, I originally removed the
Index
type since I thoughtNamedTuple
could replace it (hence thesym_colinds
andint_colinds
attempts), but I'm not sure that's possible since we sometimes need to convert symbols to integer indices and vice-versa (cf. #1305). Handling of duplicate column names need to be improved (rebased on top of #1308).@bkamins This touches areas you improved recently, and I'm not sure I'll be able to find the time to finish this right now. To avoid stepping on each other's toes (since this is required to implement the
getindex
improvements you've mentioned), if you want to take this up just let me know.The text was updated successfully, but these errors were encountered: