Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate dispatch code, remove unreachable branches #22068

Merged
merged 10 commits into from
Aug 8, 2018

Conversation

jbrockmendel
Copy link
Member

There are a couple of DataFrame methods that operate column-wise, dispatching to Series implementations. This de-duplicates that code by implementing ops.dispatch_to_series. Importantly, this function is going to be used a few more times as we move towards getting rid of BlockManager.eval and Block.eval, so putting it in place now makes for a more focused diff later.

Also removes a couple of no-longer-reachable cases in comparison ops.

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

@codecov
Copy link

codecov bot commented Jul 26, 2018

Codecov Report

Merging #22068 into master will increase coverage by <.01%.
The diff coverage is 95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22068      +/-   ##
==========================================
+ Coverage   92.06%   92.06%   +<.01%     
==========================================
  Files         170      170              
  Lines       50720    50715       -5     
==========================================
- Hits        46694    46691       -3     
+ Misses       4026     4024       -2
Flag Coverage Δ
#multiple 90.47% <95%> (ø) ⬆️
#single 42.3% <20%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.24% <100%> (-0.02%) ⬇️
pandas/core/ops.py 96.46% <92.3%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 647f3f0...c36e672. Read the comment docs.

@gfyoung gfyoung added Refactor Internal refactoring of code Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jul 27, 2018
@gfyoung gfyoung requested a review from jreback July 27, 2018 04:35
@@ -1114,6 +1114,7 @@ def na_op(x, y):
result[mask] = op(x[mask], com.values_from_object(y[mask]))
else:
assert isinstance(x, np.ndarray)
assert lib.is_scalar(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should just import at the top

not is_scalar(other))):
(is_extension_array_dtype(other) and not is_scalar(other))):
# Note: the `not is_scalar(other)` condition rules out
# e.g. other == "category"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue here is we need to be able to distinguish between an actual dtype comparison and a real comparison, e.g.
df.dtypes == 'category' (or df.dtypes == 'Int8' .

this is pretty thorny, e.g. how do you know when to convert a scalar string in a comparison op to an actual dtype for comparisons

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah. A while ago there was a check is_categorical_dtype(y) and not is_scalar(y) and it took me a while to figure out that the is_scalar part was specifically to avoid letting "category" through, so I've gotten in the habit of adding this comment for future readers.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

pls rebase

@jreback
Copy link
Contributor

jreback commented Aug 1, 2018

looks fine. can you run an asv dataframe ops (or a subset of them) to ensure that perf is not materially different. and rebase on master (if its not already)

@jbrockmendel
Copy link
Member Author

and rebase on master (if its not already)

Sure

can you run an asv dataframe ops (or a subset of them) to ensure that perf is not materially different

Will run, but there's nothing here that should affect anything.

@jbrockmendel
Copy link
Member Author

Looks like frame-frame comparisons are non-trivially slower due to numexpr being used less effectively. Specifically ATM dispatch-to-Series is done inside numexpr, while in the PR the dispatch is done outside numexpr.

asv continuous -f 1.1 -E virtualenv master HEAD -b dataframe -b DataFrame -b ops
[...]
       before           after         ratio
     [9c118668]       [94f168a8]
+      5.69±0.1ms         153±20ms    26.86  binary_ops.Ops.time_frame_comparison(True, 'default')
+      4.03±0.2ms         97.5±3ms    24.19  binary_ops.Ops.time_frame_comparison(True, 1)
+      17.8±0.2ms       27.3±0.3ms     1.54  join_merge.Join.time_join_dataframe_index_multi(False)
+      14.5±0.1ms         19.5±6ms     1.34  join_merge.Merge.time_merge_dataframe_integer_2key(True)
-        70.7±2ms         54.3±4ms     0.77  binary_ops.Ops.time_frame_comparison(False, 1)
-        72.7±4ms         51.0±2ms     0.70  binary_ops.Ops.time_frame_comparison(False, 'default')

For the time being I'll revert that part of the PR.

@jbrockmendel
Copy link
Member Author

After reverting the numexpr thing:

taskset 5 asv continuous -f 1.1 -E virtualenv master HEAD -b time_frame_comparison
[...]
       before           after         ratio
     [9c118668]       [c36e6729]
-        66.6±3ms         48.7±2ms     0.73  binary_ops.Ops.time_frame_comparison(False, 'default')
-        71.5±5ms         49.1±2ms     0.69  binary_ops.Ops.time_frame_comparison(False, 1)

taskset 5 asv continuous -f 1.1 -E virtualenv master HEAD -b time_frame_comparison
[...]
       before           after         ratio
     [9c118668]       [c36e6729]
-        66.1±2ms         49.9±2ms     0.76  binary_ops.Ops.time_frame_comparison(False, 'default')

taskset 5 asv continuous -f 1.1 -E virtualenv master HEAD -b time_frame_comparison
[...]
       before           after         ratio
     [9c118668]       [c36e6729]
-        70.8±3ms         48.0±2ms     0.68  binary_ops.Ops.time_frame_comparison(False, 'default')
-        77.5±6ms         50.4±4ms     0.65  binary_ops.Ops.time_frame_comparison(False, 1)

@jreback
Copy link
Contributor

jreback commented Aug 8, 2018

see good thing we look at perf even when it shouldn't affect things :-D

@jreback jreback merged commit 81f386c into pandas-dev:master Aug 8, 2018
@jreback
Copy link
Contributor

jreback commented Aug 8, 2018

thanks.

@jbrockmendel jbrockmendel deleted the dispatch branch August 8, 2018 15:50
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants