-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCH: add some cases for join
and merge
ops from pandas
#5021
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YarShev having the rest of the benchmarks are ok for me.
@@ -0,0 +1,261 @@ | |||
import string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have join, merge and concat benchmarks in place. let's try to add maximum 1-2 runs in each case. @YarShev
@anmyachev I think it's your call :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.
Now the main thing is to enter these benchmarks into the Modin system. We need to do a couple of things for this:
- add test runs of these benchmarks to CI
- add runs of these benchmarks to our main benchmarking configuration
- use the functionality from this pull request so that the code can work in two modes. TEST-#5014: Simplify adding new ASV benchmarks #5015
- so far, there is no automatic mechanism implemented that could automatically trigger all calculations so that we get the correct times, which means it will need to be done manually (
BenchmarkMode
does not work for OmniSci and besides, it can slow down thesetup
function) - in addition, we need to solve the issue with the dimensions of the datasets. At too small dimensions, where pandas can execute a function in a few tens of milliseconds, modin is unlikely to be able to overtake pandas. However, if you increase the dataset, you may encounter a lack of memory on the machine that runs these validation tests during the CI process. Because of this, the already added benchmarks have a built-in functionality that allows you to change the data size for benchmarks depending on these situations, but it is still manual, we need to manually select the size for these two situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.
By "leaving as is" you mean current asv testing or contents of this PR? :) I agree that functions are very important but it could be too heavy for our CI to run all cases from PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean content of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's try and see it goes but still not sure we want (at least now) # outer join of non-unique
case and all cases from Merge (eg empty and cross)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the main thing is to enter these benchmarks into the Modin system
@anmyachev, could you help @jbrockmendel to start doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YarShev ok I'll take care of it
self.temp = Series(1.0, index)[self.fracofday.index] | ||
|
||
def time_join_non_unique_equal(self): | ||
self.fracofday * self.temp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is tested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment, multiplication
I suggest starting integration of the test suite into our asv benchmark system (link). |
2ba8fef
to
a843699
Compare
Codecov Report
@@ Coverage Diff @@
## master #5021 +/- ##
==========================================
- Coverage 84.57% 84.52% -0.05%
==========================================
Files 256 257 +1
Lines 19347 19629 +282
==========================================
+ Hits 16362 16592 +230
- Misses 2985 3037 +52
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification. |
Good question, let’s find out |
I suggest renaming the folder from |
Do we want to keep both our current benchmarks and pandas ported? Maybe we should keep something only one? |
@@ -0,0 +1,165 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's integrate these benchmarks into our system so we can see the difference and the new added cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for concat
ci / lint (pydocstyle) failed, please take a look. |
@YarShev I started creating separate pull requests for each case added. |
Please ping me when the changes are ready for review. |
@YarShev ready for review |
still failing |
@YarShev ready for review |
…andas Signed-off-by: Myachev <anatoly.myachev@intel.com>
join
and merge
ops from pandas
Signed-off-by: Myachev <anatoly.myachev@intel.com>
@YarShev ready for review |
@@ -127,12 +128,56 @@ def time_join(self, shapes, how, sort): | |||
execute(self.df1.join(self.df2, how=how, lsuffix="left_", sort=sort)) | |||
|
|||
|
|||
class TimeJoinStringIndex: | |||
param_names = ["shapes", "sort"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely want to benchmark left
and inner
values for how
parameter, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have benchmarks for the parameter. I think we don't need another benchmark for this as it seems like a duplication to me.
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Broken off from #4988
Closes #5111