-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate Series / DataFrame.append #35407
Comments
+1 from me (though i will usually be plus on deprecating things generally) yeah free here these are a foot gun |
+1, it's better to have one method, which is |
Strong +1 from me! Just look at all the (bad) answers to this StackOverflow question: |
we should also deprecate expansion indexing as well (which is an implicit append) |
+1 from me |
How do you expand a dataframe by a single row without having to create a whole dataframe then? |
I'd recommend thinking about why you need to expand by a single row. Can those updates be batched before adding to the DataFrame? If you know the label you want to set it at, then you can use |
Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better? |
Not being able to add rows to a data structure makes no sense. It's one thing to not add the inplace argument but to deprecate the feature is nutts. |
@TomAugspurger using df.loc[] requires me to know the length of the dataframe. and create code like this: df.iloc[len(df) + 1] = <new row> This feel like overly complex syntax for an API that makes data operations simple. Internally Why not take a page from lists, the |
You could perhaps suggest that to NumPy. I don't think it would work in practice given the NumPy data model. |
Is Numpy deprecating the Numpy doc: https://numpy.org/doc/stable/reference/generated/numpy.append.html |
Shall we make this happen and get a deprecation warning in for 1.4 so these can be removed in 2.0? If there's no objections, I'll make a PR later (or anyone following along can, that's probably the fastest way to move the conversation forward) |
@MarcoGorelli my question still stands, why is this being done? |
Yes, why are we doing this? It seems like we're removing a VERY popular feature (the 10th most visited help page according to the OP) just because that feature is slow. But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this? |
there is a ton of discussion pls read in full this has long been planned as inplace operations make the code base inordinately complex and offer very little benefit |
@jreback I don't see tons of discussion in this issue, please point me to the discussion that I might be better informed. From what I see is a community asking you not to do this. |
There's a long discussion here on deprecating inplace: #16529
I'd argue that this is still an improvement, because then it would be clearer to users that this is a slow feature - with the status quo, people are likely to think it's analogous to What's your use-case for |
take |
this is exactly the reason append is super problematic so you have proved the point why append is a terrible idea - it's not about readability but easy to fall into traps that are non obvious at first glance |
If only there was a well-known algorithm which was not an exponential copy. |
You did not read or not understand what I was saying. The version with append is the one that WORKS, the one with concat at the end runs into memory issues (because there is the small And yes, you can be smarter, for example |
Thanks @behrenhoff , that's a nice example - though can't you still batch the concats? Say, read 10 files at a time, concat them, drop duplicates, repeat... This seems like a perfect summary of the issue anyway:
At some point we should lock the issue, this is taking a lot of attention away from a lot of people, there's been off-topic comments, no compelling use-case for keeping |
Yes, that would work. So would 1 million other solutions. In practice, I could even exploit more about the date ordering inside of the files (all files here have a rather long overlapping history, but newer files can overwrite (fix) data in older files, so it is of course a drop_dups with a subset and keep=last). My point is: this is a non-issue because the operation is done once per 6 month or so, the daily operation just adds exactly one file. No point in optimizing this further as long as it works. That is the whole point I was trying to make. You force people to optimize / change code where old code just works and where there is no need to modify it. And the real gains in this example are not in append vs concat but in exploiting knowledge of the input files and reading them in different order or in groups. Note that I am not saying this is a usecase that can only be done with Anyway, end of discussion for me. I already did the work and got rid of all my appends. I just fear that many people will not upgrade if their code breaks. You are also making it harder for new users. |
Hi, minimal reproducer that was totally broken: Before: a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
print(a.append(row))
# Output:
# A B
#0 1 2.0
#0 3 NaN After: a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
print(pd.concat([a, row]))
# Output:
# A B 0
#0 1.0 2.0 NaN
#A NaN NaN 3.0 Also, please, note that if you add deprecation warning in such popular method that is used widely and calls many times per second - this message will be spammed a lot leading to much bigger overhead than you have with allocations and memory copying. So it is beneficial to print such message only on first call. |
What are you trying to do? It would be way more efficient to call
Edit: Or was it on purpose to put |
I know, this is just an illustration. I was iterating over rows and if row is OK - adding it to another table. I believe that there are much better way via masking and concatenation with taking such masks into account, but I wanted to have code as simple as possible. |
Thanks for your response. It is important for us to see usecases that can not be done more efficiently in another way. You are right, checking data can be done way more efficiently via masking and the concatenating the result. |
How can I concat such |
with pd.concat([a, row.to_frame().T], ignore_index=True) |
You can simply do:
Just change the greater 3 to a condition that suits your needs. This avoids the iterating over the rows step. If you have to iterate for some reason, you can use the example from @MarcoGorelli |
Not all conditions and not every logic can be readable with such single-line expression. For people who like me want to just get rid of warnings: import pandas as pd
def pandas_append(df, row, ignore_index=False):
if isinstance(row, pd.DataFrame):
result = pd.concat([df, row], ignore_index=ignore_index)
elif isinstance(row, pd.core.series.Series):
result = pd.concat([df, row.to_frame().T], ignore_index=ignore_index)
elif isinstance(row, dict):
result = pd.concat([df, pd.DataFrame(row, index=[0], columns=df.columns)])
else:
raise RuntimeError("pandas_append: unsupported row type - {}".format(type(row)))
return result |
…thod is deprecated and will be removed from pandas in a future version. Use pandas.concat instead', see pandas-dev/pandas#35407
…d is deprecated and will be removed from pandas in a future version. Use pandas.concat instead', see pandas-dev/pandas#35407
Here is a use case for I have a data frame with numeric values, such as df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B']) and I append a single row with all the column sums totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals) Simple enough.
Now, using df_concat_bad = pd.concat([df, totals]) which produces
Apparently, with Fortunately, in a comment above, the implementation of df_concat_good = pd.concat([df, totals.to_frame().T]) which yields the desired
I think users need to be aware of such subtleties. I also posted this on StackOverflow. |
This was brought up in #35407 (comment) , and some other comments in this thread, and would/should be part of the transition docs (see #46825) |
Worst idea I've seen, why complicate something so easy, I think it's better to have more options/ways to do something than just one strict way. Dataframe.append() was very easy for noobies to add data to a dataframe |
"[...] around the 10th most visited page in our API docs" and they go ahead and deprecate it. |
adding series to a pandas dataframe creates a performance warning append to dataframe is deprecated, use concat instead pandas-dev/pandas#35407
This seems to be decided but, in the future, I would argue against doing these sort of things to improve user's code (and requesting proof why they can't use pd.concat when they disagree). If it improves maintainability, or makes things easier for devs, go for it. But if something is popular and not "correct", let people do what they want to do. The only valid point I've seen here is for removing the 'inplace' argument, everything else resembles nannying. |
Thanks all for your comments This is becoming draining - some comments are off-topic, no new arguments are being presented, and some are not particularly respectful. Locking for now then - if anyone has any new arguments and wants to make them in a respectful manner, no objections to opening a new issue It's understandable that some people are unhappy with this decision and have to rewrite some code, but for newbies, getting them to write their code in a better way to begin with will be better for them in the long-run. If the docs on how to use concat are unclear, pull requests are welcome |
I think that we should deprecate
Series.append
andDataFrame.append
. They're making an analogy tolist.append
, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.These are also apparently popular methods. DataFrame.append is around the 10th most visited page in our API docs.
Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single
concat
.The text was updated successfully, but these errors were encountered: