-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why isn't df.col .= v
in-place?
#3200
Comments
There are three parts of the explanation of the current behavior (unfortunately this is a bit complicated). Part 1. As opposed to other containers in Julia DataFrames.jl assumes it gets ownership of the columns stored in it. For example if you store a vector in a In general if you say that
This is unfortunate, and sorry for this. As you noticed we have had this deprecation for a long time and in previous release announced that this change would be made. However, this means that your code was unsafe. It does not say that it is not allowed to write unsafe code, but it is recommended not to write it unless you have some performance critical parts. The only scenario when this change would break code is when you store column Part 2. The docs say "Since
Each of them requires a bit different treatment.
It would be near impossible to implement it correctly + it would be inconsistent with the whole design of Julia that requires explicit broadcasting. Part 3. Now let me discuss broadcasted assignment solely. We need two behaviors. One that updates the column in-place. It is currently:
and the other that allocates new column. This is currently:
If we applied your recommendation there would be no way to do the second operation (all three operations would have the same effect. However, users need both. I.e. the second behavior (allocating new column) is a must have and quite often needed. See e.g.:
Users just want such operations to work. And we had to give them a way to do it. Also, note (you probably saw this example already) that sometimes instead of an error you get just an unexpected result:
(the second behavior is what people typically expect) Now what is the difference at its core. In general if you write So in short, from the other perspective: now Now you might ask why it was implemented only in DataFrames.jl 1.4. The fact is that we wanted to have these rules already in DataFrames.jl 1.0 release, but it was not possible, because of limitations of Base Julia. Making this change was only possible after Julia 1.7 was released. That is why we had a long deprecation period. We announced that this change would be made in earlier releases, e.g. if you check DataFrames.jl 1.3 release notes you can read:
|
Thanks a bunch for writing out this explanation for me @bkamins. Really appreciate it.
No argument there. Using DataFrames in the first place inherently type-unstable.
Is that a challenge? ;)
Maybe the "whole design of Julia" is subjective? In my mental model of Julia it's clearly breaking the "whole design of Julia" to deviate from how all other containers work and let I also recognize that there are individual differenes in mental models of what a language or library represents, and that mine is not necessarily a typical (or even "correct") one.
Fair enough. Personally I would expect DataFrame to behave like other containers in Julia, and columns in a DataFrame to behave likte Vectors (or AbstractVectors in general) behave in Julia. But I guess Thanks again! DataFrames.jl is a great library that I use daily :) |
The issue is not type stability, but object ownership. Julia does not have a well defined concept of ownership of a value (like Rust does), so it is more of a convention. Let me start with explanation of the origin of the ownership concept related to a most common bug in code using DataFrames.jl. Since you can extract a column as a vector from a data frame you can resize it. If you resize such a vector you invalidate a data frame (as every column of a data frame must have the same number of elements; note that this is a key difference from other containers - e.g. It might seem an innocent issue and not very problematic, but in practice it is the opposite. Users, in the past, tended e.g. to create one data frame from another data frame without copying columns and then they were surprised that when they mutated one then the other was mutated. The most problematic case is when rows were added/removed as it invalidated the code. This was the most common problem in the past. To resolve it we introduced a rule of column ownership and However, we allow for Similarly Now, why all this is done? Most of the DataFrames.jl users are not Computer Science experts. As you have said - many of them are entry level programmers, not necessarily even understanding all the details how things work underneath. Therefore we design DataFrames.jl to be as simple to use as possible by them. For example the operation
To some extent you are right.
I think this is a good summary of the problem here. As I have said above, we need to make DataFrames.jl relatively friendly to users that are not used to strongly-typed languages (and this is the majority - as you have commented - coming from R/Python and wanting things to mostly "just work"). In fact initially |
By the way - in some style guides of corporate users there is a rule is that using |
Finally - your feedback is much appreciated. I am planning to write a blog post on this change (as it was the most painful decisions we needed to make in DataFrames.jl design in general) and the discussion with you is an important in the preparation process. Also note that historically |
Forgive me if I digress at times, but I find the conversation interesting:
Right. The issue I encountered arose from getting a different type under the new behaviour but the issue we're focusing on here is rather one of ownership. I also think object ownership is a larger issue in Julia, not specific to DataFrames. Akin to how
I think this was a great decision. I appreciate a simple and clear option to be able to not get a copy when I don't want one.
I'm personally an advocate for only using DataFrames while manipulating data and then to stop using the DataFrame as soon as you can. I.e to not let the DataFrame be some generic container with unknown columns that you pass around. This is related to the lack of type stability / lack of schema, not necessarily for performance but for correctness. (note I don't think DataFrames need a schema). Which brings me to:
Apart from the simple case of
This let's me work with data from the table in a dynamic way while writing performant code. And I can compute gradients with Zygote or ForwardDiff. Exceptional? Performance-critical? Maybe both. But it can also be used to effectively ensure a table schema for the purpose of correctness rather than performance.
Makes sense to me. That would be the expected behaviour of in-place assignment to something that doesn't exist. Like elementwise assignment to a Matrix outside of its dimensions (and I gather that consistency with matrices is expected and wanted?)
I think users would also accept if this created a new/replaced column and filled it with the scalar (per the pseudo-broadcasting logic) instead of the error message even if they didn't use broadcast assignment. In much the same way the constructor accepts scalars (i.e
Yes! That's really awesome.
Those were the days. I might still be stuck in some ancient mindset. Looking forward to that blog post! |
There is a predefined function to do this: |
I haven't been following this change very closely, so take what I say with a grain of salt. However, I find this design choice a little befuddling.
Exactly. This was a design choice. It's quite natural to have variables aliased with x = [1, 2];
df = DataFrame(x=x; copycols=false) Now I think the fact that a change to Base Julia was needed in order to implement this design is a hint that this might not have been the best design choice. |
To elaborate on this, |
Here's one final variation of my argument against this. (Sorry.) The I admit it's not a perfect line of logic. The expression |
We have thought about all these concerns. However, if we followed them the following:
would have to error as well as:
would have to error. Now there was a design choice:
Having this choice we opted for the second variant and all other are a consequence. Note that this does not apply only to broadcasting, it also applies to non-broadcasted assignment. Now why we have not opted for the first variant? The reason is that In summary - the current design is that The alternative would be to have three sets of rules: for normal row indexing, Indeed the choice breaks some assumptions advanced users might make about how
and everyone coming from R or Python will assume it should work. As a side note, for completeness of the discussion (and further reference). Probably advanced users are surprised to see this:
and would argue that column Here I do not want to justify breaking the rule second time because we already did this some time ago. Rather I want to emphasize that the design choices we make are geared towards novice users. It is much easier to learn for an expert one exception than to teach a novice that something does not work as this novice would expect given their previous background in R/Python. |
I disagree. Conversion upon assignment is not controversial. A range could become a vector if that's what the lefthand side can hold.
Something not iterable or of length length 1 could be "fill"ed. Conversion of the left hand side upon broadcasted assignment is controversial though. Wouldn't a novice be more likely to try I'd argue that the current approach is to explain to them to use a new advanced syntax instead of something straightforward and simple that would align well with the rest of the language.
|
Sorry for being obtuse as you tried to explained this already. But I don't follow to how that design choice is related to this thread.
|
Upon broadcasted assignment it is also controversial to create a new copy of the left hand side Like |
This is the point. They could be used interchangeably and this was a design choice. An alternative would be to make the behavior of
This is intentional and a design choice. Actually it is quite important that it behaves this way.
then you see that The contract of broadcasted assignment for
The point is that n general column can be any vector type (not only
Indeed this is what user would try to do. And then - as you pointed would be instructed to do |
Actually, that does not work in pandas: In [1]: import pandas
In [2]: pandas.__version__
Out[2]: '1.4.2'
In [3]: df = pandas.DataFrame({"a": [1, 2]})
In [4]: df.b = [3, 4]
<ipython-input-4-ee8d286e409b>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df.b = [3, 4]
In [5]: df
Out[5]:
a
0 1
1 2 It does not bother me that This is my understanding of what happened:
I think the old saying "two wrongs do not make a right" applies here. If the Julia language had a way to enforce public versus private API, then these sorts of shenanigans would not even be possible. In my opinion, if you want to change the meaning of a well established syntax, then you should use a macro. A better way of expressing the concept "set a column to a new vector containing a repeated value, whether or not that column already exists" would be something like this: @fill df.c 1 which of course would evaluate to this: df.c = fill(1, nrow(df)) Anyhow, that's probably all I have to say about this. It is what it is, and I doubt that this design will be reverted. *One point of clarification. My comment that "It should throw an error," is based on the fact that |
@gustafsson + @CameronBieganek : it is truly a pity we did not have this discussion 2 years ago. The clean up of indexing, that was done around 2019 and 2020 (if I recall correctly) was mainly to simplify rules and make them consistent (what we had previously was very complex and full of special cases). And making I am not sure where we would end up with the decision, but you raised issues that were clearly previously not considered enough. As I have written earlier - while this syntax is debatable - in my experience (fortunately) the change we make should not affect much of the ueser's code (however, indeed it will affect some unfortunately). What I can do is:
Side note.
but
which is different from what we do in Julia (this is a comment showing how complex corner cases of things are). |
Perhaps you could open a Julia issue requesting that If the request was approved and added to a future Julia version, technically you should probably lower bound DataFrames to the Julia version that releases that new API. In theory older versions of Julia could have new patch releases that change the behavior of There seems to be a fair number of packages that have overloaded |
Let's not make the issue look more serious than it is. There's no reason to think that Anyway, this behavior has been released, so removing it now would be breaking. The most radical change we could do is deprecate it so that it can be removed in 2.0 (which isn't planned at all at this point). However it would seem like a good idea to document this in Julia so that it's clearly considered as part of the public API. |
That's scary. 😂
I agree that breakage is not likely, but with the current situation there is still a serious issue with logical consistency. As I mentioned above, one simply cannot deduce from the defined Julia interfaces that |
Maybe they will be intrigued the first time they see it, but I don't see a particular risk of confusion leading to bugs, which is what we generally try to avoid when designing the API. |
I for one was very confused and couldn't get my head around what to expect from the behaviour of I'll try to summarize my own understanding of the question that started this thread:
Base Julia states that broadcast! should "store the result in the dest array". And although undocumented Why?It's a design choice to appease newcomers. The reasoning (from what I gather) goes something like this: Newcomers would want A new user probably hasn't seen In essence: I note that If it's not in-place, what is it then?
How can I do what
|
This issue is a question on the most recent release notes
I expected
df.col .= v
to broadcast and do in-place assignment. But I see that's no longer the case in Dataframes.jl 1.4.The recent release broke some code of mine (I must have missed any deprecation warnings). A simple workaround was
coltemp = df.col; coltemp .= v
but I don't understand the reason for the new behaviour. To me this seems to make DataFrame inconsistent with other containers in Julia and left me wondering why this inconsistency would be a wanted one.This issue equally applies to
df[!, :x] .= v
.Compare:
Whereas
As advertised
df.x .= 1.5
does not work in-place but replaces the column, even with a new type.If I put the vector in any other container, say, a
Dict
,NamedTuple
orstruct
They all behave the same. But a DataFrame behaves differently. Why is that?
The docs state "Since
df[!, :col]
does not make a copy" which to me makes it unexpected that it would create a new column rather than modifying the existing one.For the use case of "create/replace column" we have
df.x = v
(akin tos.x = v
ordict[:x] = v
). Would there be any adverse side-effects of letting=
broadcast scalars into new/replaced columns?I understand there was a decision a year ago (#2804) to make
df.x .= v
work liked[!,:x] .= v
but wouldn't a change to instead makedf[!,:x] .= v
work likedf.x .= v
have been more consistent with how containers in Julia typically work?The text was updated successfully, but these errors were encountered: