-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN: dont consolidate in reshape.concat #34683
CLN: dont consolidate in reshape.concat #34683
Conversation
If you want to run benchmarks, I think you will need to run a dedicated benchmark where you create some non-consolidated data, as an impacted case is not necessarily covered by ASV. |
if we don’t have asvs or tests then there is nothing to do and the change is fine. If you want to create and push an asv great but we cannot benchmark to some hypothetical asv. |
Sorry, but I completely disagree with that statement. Regularly on PRs that might impact performance, we ask PR authors to ensure there is a benchmark for the case they are changing. If there isn't yet one, we add one. We very well know that are benchmark coverage is not complete. For a general check, we can of course only rely on existing benchmarks, but for a PR that targets a very specific use case, we can test that specifically or add a benchmark. |
my point which you missed that we already have tons of asvs for functions which hit this but you don’t want to merge because we might impact some hypothetical which we can’t even right down? |
I am not sure we have any asv's specifically for this (it might be, but can you then point to one?). I don't follow which hypothetical case you are talking about. You mean a non-consolidated dataframe? That's not hypothetical (eg add a column to a DataFrame, and you have a not-yet-consolidated dataframe). Why would we otherwise have those Maybe in my original comment the "non-consolidated data" was a bit confusing. I am not talking about a hypothetical non-consolidating BlockManager, but about an actual, current DataFrame that is not consolidated at the moment when calling |
Something like In [21]: df = pd.DataFrame(index=list(range(100)))
In [22]: df1 = pd.DataFrame(index=list(range(100)))
In [23]: df2 = pd.DataFrame(index=list(range(100)))
In [24]: for i in range(10):
...: df1[i] = np.random.randn(len(df))
...: df2[i] = np.random.randn(len(df))
...:
In [25]: pd.concat([df1, df2]) Do we currently consolidate always in |
Indeed, something like that. BTW, with my original comment, I didn't mean to ask for much. Just a small example like the one Tom showed (but maybe a bit larger data), and do a And quite probably, such a timing would show there isn't much impact (or maybe even faster without consolidating since it's a copy less?), but if we don't check that, we don't know. Such a small check is IMO a minimum for PRs like this. |
|
…n-consolidate-concat
hmm the 933x one makes zero sense, since it doesn't go through the affected code path. Checking with timeit shows much smaller change, so writing that asv result off as incorrect |
Using the example Tom suggested, this is appreciably slower:
I expected the result in [22], am surprised by [23], will have to take a look at what is driving the results. We need to decide how much (if any) perf penalty we're willing to accept to avoid side-effects. |
Looks like the call to
PR:
master
|
@jorisvandenbossche do you have a workaround in mind for how the all-1D case would avoid this perf hit? if so, could that workaround be applicable here? |
@jbrockmendel just to get a sense for order of magnitudes here: is the change in workloads roughly equivalent to the following? # consolidated: one (100, 10) array -> (200, 10) array
In [15]: a = np.ones((100, 10))
In [16]: %timeit np.concatenate([a, a])
1.86 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# non-consolidated: ten (100,) arrays -> ten (200,) arrays
In [17]: bs = [np.ones(100,) for _ in range(10)]
In [18]: %timeit [np.concatenate([b, b]) for b in bs]
15.9 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
@TomAugspurger not sure which measurement your example is supposed to be analogous to. The slowdown (8.5x) is pretty similar to the slowdown mentioned above inside |
@TomAugspurger if you use slightly larger data, the python overhead (of the for loop + multiple function calls) decreases quickly:
|
Just comparing the general idea of "concat this 2D array" vs. "concat these many 1D arrays" to get a sense for how things perform. The raw number of rows and the relative number of rows to columns does matter to an extent. import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import default_timer as tic
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
n_rows = [10, 100, 1000, 10_000, 100_000]
n_cols = [10, 100, 1000]
timings = []
for n in n_rows:
for c in n_cols:
for i in range(3):
a = np.ones((n, c))
bs = [np.ones(n,) for _ in range(c)]
t0 = tic()
np.concatenate([a, a])
t1 = tic()
[np.concatenate([b, b]) for b in bs]
t2 = tic()
timings.append((n, c, i, 'consolidated', t1 - t0))
timings.append((n, c, i, 'split', t2 - t1))
df = pd.DataFrame(timings, columns=['n_rows', 'n_cols', 'trial', 'policy', 'time'])
df
g = sns.FacetGrid(df, col="n_cols", hue="policy", )
g.map(
sns.lineplot, 'n_rows', "time",
)
g.set(xscale="log", yscale="log")
g.add_legend() |
So I definitely need to add an asv for this (based on #34683 (comment)). Aside from that, we need to do one of three things:
|
@jorisvandenbossche in the ArrayManager PR you said that the code snippet here performed better than in master. Any insight into improving the performance here? |
…n-consolidate-concat
Looks like in concatenate_block_managers it is going through concat_compat and making more copies Edit: nope, that doesnt explain it... OK: within concatenate_block_managers we do two things per-block: JoinUnit.is_na, and concat_compat. It isn't clear to me why the ArrayManager version would be able to avoid these. |
…n-consolidate-concat
how's the asv's on this now? |
…n-consolidate-concat
Just pushed, this is now slightly faster than master on the benchmark above #34683 (comment) |
…n-consolidate-concat
…n-consolidate-concat
…n-consolidate-concat
…n-consolidate-concat
@cache_readonly | ||
def _get_concat_axis(self) -> Index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is property, then maybe _concat_axis
, without get
would be better?
…n-consolidate-concat
…n-consolidate-concat
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
…n-consolidate-concat
rebased+green |
…n-consolidate-concat
pandas/core/internals/blocks.py
Outdated
|
||
cls: Type[Block] | ||
|
||
if is_sparse(dtype): | ||
# Need this first(ish) so that Sparse[datetime] is sparse | ||
cls = ExtensionBlock | ||
elif is_categorical_dtype(values.dtype): | ||
elif dtype.name == "category": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not is Categorical
? e.g. since we are removing comparison vs 'category' generally (in your other PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do isinstance(dtype, CategoricalDtype)
. either way is fine by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think prefer that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated+green
…n-consolidate-concat
…n-consolidate-concat
thanks |
Looks like this consolidation was added here 3b1c5b7 in 2012, no clear reason why it is needed. About to start an asv run.