-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: is_copy #18801
Comments
What is the alternative for power users in library code if they want to avoid an extra unnecessary copy? (I never use it myself, as I just do |
- Renamed 'is_copy' attribute to '_is_copy' for internal use - Setup getter and setter for 'is_copy' - Added tests for deprecation warning
So the suggested solution is to use |
@amueller not sure there is anything to add to: http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy you are chain indexing, which violates view semantics. the point is it may work, but there are cases where it won't. without copy-on-write, you must copy. |
Ok maybe I just really don't understand the documentation, which is entirely possible. My reading of the warning is that we are returning a copy here, which is the intent. Are you saying it might sometimes return a view instead? I don't want to use view semantics, and it tells me I got a copy. I'm very happy I got a copy, it's what I wanted. If I got a view instead, I would need to copy. But I thought the warning said I got a copy, not a view. |
exactly, you have understood the point. you don't now whether it is a copy or a view on the original. That is the problem. you are doing chained operations and we can't be sure, so you get the warning. it is up to you to: 1) not chain operations, 2) defensively copy. |
So would the warning also be thrown if it is a view? |
If it's also thrown if it's a view, then the warning is misleading, it says "A value is trying to be set on a copy of a slice from a DataFrame". If it's not thrown on a view, then it seems like I can distinguish between view and copy, and then I should only copy if I got a view. |
no, if you only have a single dtyped dataframe you won't get this. it only occurs when you filter then add a column on multiple dtypes. |
The question is: can I not find out at runtime if I got a copy or a view and only copy if I got a view? |
you can try by introspecting the underlying arrays (not .values) |
ok. Does that mean that the warning might have been raised even though there is memory sharing? |
Sorry if that question was answered by
but I don't know how that relates to what happens to the memory. I assume it was meant as a reply to #18801 (comment) but I don't understand how it relates to it. |
because someone could have chained indexed and we don’t know if views are created we it’s jt trivial and mostly edge cases but if you are seeing the warning then you have incorrect code use at your own risk - you should copy after filtering |
Alright. I feel the warning is pretty confusing since it seems to imply that we made a copy, but it only implies that there is some part of the dataframe that was copied, and we don't actually know whether we made a copy or not.
Maybe the section in the docs that discusses this warning should say that? I don't think it says that now. |
To repeat myself from the issue: I think @amueller use case is valid one that we should try to support. If not through In case of sklearn's
Explicitly taking a copy is not mentioned in those docs, so could certainly be added. |
Although plotnine uses |
Until copy-on-write, his is simply not possible in pandas in a reliable way. We don't have full control over memory allocations or when views are actually made. |
@has2k1 I see for |
It is an okay stopgap measure until copy-on-write is available, but as it implicitly assumes user cognisance it is not a good long term solution. Also, since the package aims to be extensible in many ways, the effects of a context manager may extend to other packages. On the other-hand 'is_copy' was explicit, it forced the user to acknowledge the potential problem at every instance and I think it was better in an open source environment. |
I have a different reason to want this: I'm working on a data pipeline with large enough datasets that I'm worried about the performance hit from repeated copies. An easy way to try to control that would be something like |
Yet another feasible use case can be when trying to do multi-processing where portions of a DataFrame are processed in different processes. I was under the assumption that if I take a view, when a process is spawned, only the view will be copied over taking 2X memory. In contrast, if I make a copy, then essentially the original process now has two full copies and each process will also have the partial copy so we will end up with 3X memory requirement... |
this has always been an internal attribute. We can simply replace by
._is_copy
and provide a deprecation warning on the property.The text was updated successfully, but these errors were encountered: