Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: changed default value of cache parameter to True in to_datetime function #26043

Merged
merged 22 commits into from
Jul 4, 2019

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Apr 10, 2019

  • closes #N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@codecov
Copy link

codecov bot commented Apr 10, 2019

Codecov Report

Merging #26043 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26043      +/-   ##
==========================================
- Coverage    91.9%   91.89%   -0.01%     
==========================================
  Files         175      175              
  Lines       52485    52485              
==========================================
- Hits        48235    48230       -5     
- Misses       4250     4255       +5
Flag Coverage Δ
#multiple 90.45% <ø> (ø) ⬆️
#single 40.78% <ø> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 84.59% <ø> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.62% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d9b702...317267c. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 10, 2019

Codecov Report

Merging #26043 into master will increase coverage by 0.07%.
The diff coverage is 95.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26043      +/-   ##
==========================================
+ Coverage   91.79%   91.86%   +0.07%     
==========================================
  Files         180      179       -1     
  Lines       50934    50728     -206     
==========================================
- Hits        46753    46600     -153     
+ Misses       4181     4128      -53
Flag Coverage Δ
#multiple 90.45% <95.45%> (-0.03%) ⬇️
#single 41.1% <50%> (-0.94%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 85.67% <95.45%> (+0.61%) ⬆️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
pandas/plotting/_misc.py 59.49% <0%> (-5.38%) ⬇️
pandas/plotting/_matplotlib/converter.py 58.43% <0%> (-5.24%) ⬇️
pandas/io/excel/_openpyxl.py 84.71% <0%> (-3.23%) ⬇️
pandas/core/config_init.py 92.91% <0%> (-3.17%) ⬇️
pandas/io/formats/printing.py 85.56% <0%> (-1.65%) ⬇️
pandas/core/internals/managers.py 95.21% <0%> (-0.94%) ⬇️
pandas/core/internals/construction.py 95.93% <0%> (-0.82%) ⬇️
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ab9ff5...2b1fa85. Read the comment docs.

@anmyachev
Copy link
Contributor Author

@jreback apparently, I didn't understand enough last time about the reasons for the problems in CI builds #25990 (comment) and the problem was different (maybe in the master, from which I made my branch)

Asv's result will be later.

@mroeschke
Copy link
Member

One downside of changing the default to True is that this bug (edge case) will be more evident to users #22305

Additionally, can you show ASVs/performance benchmarks for converting a small number of arguments? I suspect there will be a performance with a small amount of arguments and cache=True. Curious to see if there's an impact in this case (and how significant)

@gfyoung gfyoung added API Design Performance Memory or execution speed performance Datetime Datetime data dtype labels Apr 11, 2019
@anmyachev
Copy link
Contributor Author

For a vision of the situation as a whole:
asv continuous -f 1.05 origin/master to_datetime_cache_true -a warmup_time=1 -a sample_time=1:

master patch ratio test_name
1.81±0.01s failed n/a timeseries.ToDatetimeNONISO8601.time_different_offset
489±2μs 2.33±0.01ms 4.77 timeseries.TimeDatetimeConverter.time_convert
4.52±0.02ms 9.22±0.2ms 2.04 reshape.Cut.time_cut_datetime(4)
5.16±0.03ms 10.0±0.07ms 1.94 reshape.Cut.time_cut_datetime(10)
18.9±0.1ms 31.6±0.4ms 1.67 plotting.TimeseriesPlotting.time_plot_regular_compat
19.8±0.1ms 32.7±0.1ms 1.65 plotting.TimeseriesPlotting.time_plot_irregular
2.23±0.02ms 3.46±0.02ms 1.55 timeseries.ToDatetimeISO8601.time_iso8601
2.25±0.02ms 3.47±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_format
2.22±0.01ms 3.42±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_nosep
2.26±0.01ms 3.48±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_format_no_sep
1.58±0.01ms 2.34±0.01ms 1.49 timeseries.ResampleDatetetime64.time_resample
11.3±0.09ms 16.5±0.6ms 1.46 reshape.Cut.time_qcut_datetime(10)
10.6±0.05ms 15.4±0.2ms 1.46 reshape.Cut.time_qcut_datetime(4)
1.47±0.01ms 1.98±0.01ms 1.35 timeseries.ResampleDataFrame.time_method('min')
1.46±0.01ms 1.96±0ms 1.34 timeseries.ResampleDataFrame.time_method('max')
1.44±0.01ms 1.90±0.02ms 1.32 io.csv.ReadCSVParseDates.time_baseline
1.73±0.01ms 2.24±0.02ms 1.30 io.csv.ReadCSVParseDates.time_multiple_date
24.6±0.2ms 29.9±0.2ms 1.21 reshape.Cut.time_cut_datetime(1000)
947±4μs 1.10±0ms 1.16 timeseries.ResampleDataFrame.time_method('mean')
1.78±0.01ms 2.03±0.02ms 1.15 groupby.Datelike.time_sum('date_range')
3.59±0.02ms 4.08±0.04ms 1.14 timeseries.ToDatetimeYYYYMMDD.time_format_YYYYMMDD
43.3±0.5ms 48.5±0.4ms 1.12 reshape.Cut.time_qcut_datetime(1000)
1.34±0.02ms 1.46±0.01ms 1.09 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
9.02±0.1ms 9.79±0.07ms 1.09 io.sql.ReadSQLTable.time_read_sql_table_parse_dates
1.41±0.03ms 1.53±0ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
1.73±0.02ms 1.88±0.01ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
1.68±0.01ms 1.82±0.01ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
2.34±0.02ms 2.51±0.05ms 1.08 rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'float', 'kurt')
1.44±0ms 1.54±0.01ms 1.07 algorithms.Hashing.time_series_int
92.4±2ms 98.8±0.6ms 1.07 binary_ops.Ops.time_frame_multi_and(False, 1)
456±3μs 484±4μs 1.06 frame_methods.Isnull.time_isnull_floats_no_null
1.44±0.01ms 1.53±0ms 1.06 algorithms.Hashing.time_series_timedeltas
719±7ns 761±10ns 1.06 period.PeriodProperties.time_property('min', 'hour')
5.85±0.2ms 6.17±0.07ms 1.06 frame_methods.Equals.time_frame_float_unequal
1.45±0ms 1.52±0.01ms 1.05 algorithms.Hashing.time_series_float
3.13±0.04ms 3.30±0.02ms 1.05 rolling.VariableWindowMethods.time_rolling('Series', '1h', 'float', 'min')
713±4ns 749±20ns 1.05 period.PeriodProperties.time_property('min', 'quarter')
141±0.6ms 134±2ms 0.95 strings.Methods.time_rpartition
105±0.7μs 99.8±0.6μs 0.95 indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('datetime', 'unique_monotonic_inc')
3.23±0.05μs 3.04±0.04μs 0.94 offset.OnOffset.time_on_offset(<YearBegin: month=1>)
1.74±0.01ms 1.64±0.01ms 0.94 rolling.Methods.time_rolling('DataFrame', 1000, 'float', 'skew')
3.59±0.02μs 3.37±0.1μs 0.94 timedelta.TimedeltaConstructor.time_from_np_timedelta
231±6ns 216±2ns 0.93 timestamp.TimestampProperties.time_dayofweek(tzutc(), None)
24.0±2ms 20.7±0.2ms 0.86 indexing.NonNumericSeriesIndexing.time_getitem_list_like('string', 'nonunique_monotonic_inc')
318±5ms 187±4ms 0.59 io.stata.StataMissing.time_read_stata('tw')
283±5ms 162±3ms 0.57 io.stata.StataMissing.time_write_stata('tw')
326±6ms 184±4ms 0.57 io.stata.StataMissing.time_read_stata('ty')
338±5ms 190±6ms 0.56 io.stata.StataMissing.time_read_stata('th')
337±5ms 187±5ms 0.55 io.stata.StataMissing.time_read_stata('tm')
331±8ms 181±5ms 0.55 io.stata.StataMissing.time_read_stata('tq')
264±10ms 129±3ms 0.49 io.stata.Stata.time_write_stata('tw')
220±5ms 97.0±2ms 0.44 io.stata.Stata.time_read_stata('tw')
242±5ms 99.4±3ms 0.41 io.stata.Stata.time_read_stata('th')
242±1ms 98.1±3ms 0.41 io.stata.Stata.time_read_stata('tq')
238±4ms 96.0±2ms 0.40 io.stata.Stata.time_read_stata('tm')
239±6ms 91.4±2ms 0.38 io.stata.Stata.time_read_stata('ty')
336±10ms 17.0±0.1ms 0.05 timeseries.ToDatetimeFormat.time_exact
324±3ms 16.0±0.3ms 0.05 timeseries.ToDatetimeFormat.time_no_exact
111±4ms 3.60±0.03ms 0.03 timeseries.ToDatetimeFormatQuarters.time_infer_quarter
931±4ms 2.56±0.01ms 0.00 timeseries.ToDatetimeNONISO8601.time_same_offset

@mroeschke I'll see what can be done with #22305.

It seems that I ran into another kind of error, has it been mentioned before?

Under a small number of arguments you understand 10 - 100?

@mroeschke
Copy link
Member

What other error did you run into? And sure around 50 argument where there are no duplicate arguments to parse.

@anmyachev
Copy link
Contributor Author

not that branch, therefore I will do push --force. Sorry

@anmyachev anmyachev force-pushed the to_datetime_cache_true branch from 29530f4 to 317267c Compare April 13, 2019 16:04
@anmyachev
Copy link
Contributor Author

@mroeschke I have provided a workaround for #22305 in #26078 PR. Can you see?

@anmyachev
Copy link
Contributor Author

First of all, asv test:

class ToDatetimeCacheSmallCount(object):

    params = [True, False]
    param_names = ['cache']

    def setup(self, cache):
        N = 50
        rng = date_range(start='1/1/2000', periods=N)
        self.unique_date_strings = rng.strftime('%Y-%m-%d').tolist()

    def time_unique_date_strings(self, cache):
        to_datetime(self.unique_date_strings, cache=cache)

asv run -E existing -b ^timeseries.ToDatetimeCacheSmallCount -a warmup_time=1 -a sample_time=3:

cache test_time
True 501±20μs
False 335±0.9μs

Also I decide perform tests with 5000 elements(to be more confident in numbers)

size increase time
50 50%
5000 48%

@anmyachev
Copy link
Contributor Author

I do not know yet what the error is and I want first to do rebase from master

@anmyachev anmyachev force-pushed the to_datetime_cache_true branch from 317267c to d45c434 Compare April 15, 2019 09:56
@anmyachev
Copy link
Contributor Author

When I run timeseries.ToDatetimeNONISO8601.time_different_offset, the following error appears:
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

@jreback
Copy link
Contributor

jreback commented May 12, 2019

can you merge master and update

@anmyachev anmyachev force-pushed the to_datetime_cache_true branch from d45c434 to 12eac47 Compare May 12, 2019 22:02
@jorisvandenbossche
Copy link
Member

What is the performance impact for the (rather typical I think) case with all unique datetimes?

@anmyachev can you provide some timings for that? (or point to the benchmark result in one of your previous comments that represent that case)

@jorisvandenbossche
Copy link
Member

See also comment of @TomAugspurger on the read_csv PR: #25990 (comment)

@anmyachev
Copy link
Contributor Author

What is the performance impact for the (rather typical I think) case with all unique datetimes?

@anmyachev can you provide some timings for that? (or point to the benchmark result in one of your previous comments that represent that case)

@jorisvandenbossche result for case with all unique datetimes (decrease in performance about 2 times):

asv run -E existing -b ToDatetimeCacheSmallCount -a warmup_time=0.5 -a sample_time=1:

- count count count count
cache 50 500 5000 100000
True 574±3μs 705±10μs 1.74±0.01ms 35.0±0.9ms
False 319±2μs 391±8μs 977±0μs 15.4±0.08ms

Benchmark for asv:

class ToDatetimeCacheSmallCount(object):

    params = ([True, False], [50, 500, 5000, 100000])
    param_names = ['cache', 'count']

    def setup(self, cache, count):
        rng = date_range(start='1/1/1971', periods=count)
        self.unique_date_strings = rng.strftime('%Y-%m-%d').tolist()

    def time_unique_date_strings(self, cache, count):
        to_datetime(self.unique_date_strings, cache=cache)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a whatsnew note in the performance section

pandas/core/tools/datetimes.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented May 15, 2019

is this orthogonal to #26097

@vnlitvinov
Copy link
Contributor

is this orthogonal to #26097

Not sure... that PR is trying to address a bug that manifests when cache=True, so if default is changed then exposure for that bug would be higher. But still these could potentially be applied independently.

@anmyachev anmyachev force-pushed the to_datetime_cache_true branch from 12eac47 to 81f54c0 Compare May 17, 2019 12:43
@jreback jreback added this to the 0.25.0 milestone May 19, 2019
@jreback
Copy link
Contributor

jreback commented May 19, 2019

small comment, pls merge master and ping on green.

pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
@anmyachev anmyachev force-pushed the to_datetime_cache_true branch from 5210268 to 98e18a8 Compare July 3, 2019 19:19
@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

lgtm. ping on green.

@jorisvandenbossche jorisvandenbossche mentioned this pull request Jul 3, 2019
@pep8speaks
Copy link

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 332:80: E501 line too long (83 > 79 characters)

kwds['cache_dates'] = do_cache
read_csv(self.data(self.StringIO_input), header=None,
parse_dates=[0], **kwds)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger ok method of handling ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although... I worry it would incorrectly catch a TypeError in the function? The other way might be to check pandas.__version__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, let me see what i can do

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

patched an edge case that was showing up the asvs

@jreback jreback merged commit ce567de into pandas-dev:master Jul 4, 2019
@jreback
Copy link
Contributor

jreback commented Jul 4, 2019

thanks @anmyachev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants