Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor timezones functions out of tslib #17274

Closed
wants to merge 16 commits into from

Conversation

jbrockmendel
Copy link
Member

_libs.tslib is over 5k lines and is imported by a bunch of other modules including _libs.lib. It looks like it was pasted together from a bunch of older files. There are a handful of areas where significant chunks can be refactored out in complexity-reducing (and testability-increasing) ways. This is the first one: timezones. (Next up: parsing (and gathering parsing code dispersed across pandas))

timezones.pyx has no other dependencies within pandas, helping to de-tangle some of the _libs modules

Code in timezones is used in both _libs.tslib and _libs.period, and a bit in _libs.index.

This is one of several steps in making _libs.period not need to import from non-cython code and
ideally not need to import tslib (though NaT makes that tough). See existing comments
in _libs.__init__ on why this is desireable.

This is the first of several independent pieces to be split off of tslib.

There are also several notes on functions that appear to be unused and may be ready for removal.

Removes datetime_helper dependency from most of _libs, as it is somehow slower than a plain cython version. In cases where C can be replaced by idiomatic cython without hurting performance, I'm calling that a win.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

timezones.pyx has no other dependencies within pandas, helping to de-tangle some of the _libs modules

Code im timezones is used in both tslib and period, and a bit in index.pyx

This is one of several steps in making _libs.period not need to import from non-cython code and
ideally not need to import tslib (though NaT makes that tough).  See existing comments
in _libs.__init__ on why this is desireable.

This is the first of several independent pieces to be split off of tslib.

Cleanup commented-out code
@pep8speaks
Copy link

pep8speaks commented Aug 17, 2017

Hello @jbrockmendel! Thanks for updating the PR.

  • In the file setup.py, following are the PEP8 issues :

Line 487:50: E502 the backslash is redundant between brackets
Line 493:51: E502 the backslash is redundant between brackets
Line 502:52: E502 the backslash is redundant between brackets

Comment last updated on August 28, 2017 at 17:29 Hours UTC

from libc.stdlib cimport free

from pandas import compat
from pandas.compat import PY2
from pandas.core.dtypes.generic import ABCDateOffset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An upcoming PR will make a cython version of dtypes.generic. The isinstance checks are about 2x faster that way. It's a small difference, but nice to avoid calling into python for these things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep perf changes separate. I would much prefer PR's that don't mix multiple changes all at once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I just pushed a commit that reverted from ABCDateOffset back to offsets.DateOffset.

if isinstance(other, (timedelta, np.timedelta64,
offsets.Tick, offsets.DateOffset,
Timedelta)):
if isinstance(other, (timedelta, np.timedelta64, ABCDateOffset)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timedelta subclasses timedelta and Tick subclasses DateOffset, so we can remove them from these checks.

@@ -3960,7 +3889,7 @@ for _maybe_method_name in dir(NaTType):
def f(*args, **kwargs):
raise ValueError("NaTType does not support " + func_name)
f.__name__ = func_name
f.__doc__ = _get_docstring(_method_name)
f.__doc__ = _get_docstring(func_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated bug. This block of code exists to make NaT raise ValueError for each of a bunch of datetime methods. This is supposed to attach the original docstring to the new function f. The typo _method_name instead of func_name means that a bunch of the docstrings are currently wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then pls don't change here. rather in another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change. Apologies.

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Aug 17, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 17, 2017

Removes datetime_helper dependency from most of _libs, as it is somehow slower than a plain cython version.

Do you have any (empirical) timing information to support this statement?

@jbrockmendel
Copy link
Member Author

Do you have any (empirical) timing information to support this statement?

Brief discussion of profiling results here.

@jorisvandenbossche
Copy link
Member

Regarding the total_seonds in datetime_helper.h, a comment mentions there it is for 2.6 compat, so it can maybe be removed altogether? (without a cython replacement)

@jbrockmendel
Copy link
Member Author

[...] so it can maybe be removed altogether?

That would be excellent. If there's a concensus, I'll amend the PR.

@jorisvandenbossche
Copy link
Member

That would be excellent. If there's a concensus, I'll amend the PR.

You will need to check that it is actually possible, but I suppose it will. timedelta.total_seconds() was only added in python 2.7, but since we dropped 2.6 support long ago, we can just use the total_seconds method instead of our custom function.

@jbrockmendel
Copy link
Member Author

but since we dropped 2.6 support long ago, we can just use the total_seconds method instead of our custom function.

@jorisvandenbossche I'll give that a try and see how it goes. Is it possible/likely that this would also explain why _libs/src/datetime/np_datetime.c duplicates a bunch of stuff from numpy 1.7? i.e. at a time when pre-1.7 versions of numpy were supported?

@jbrockmendel
Copy link
Member Author

Getting rid of total_seconds (and tot_seconds) in all of the pyx files works. There is one call from objToJSON.c that someone else should take a look at.

@@ -0,0 +1,3 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this? this is NOT a package by definition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that it can be imported python-style. Or if you're referring to the shebang that's force of habit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls don't add shebangs anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally you should be adding __init__.pyx for cython packages (IIRC you can then still import)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally you should be adding init.pyx for cython packages (IIRC you can then still import)

Just renaming led to ImportError. Does something need to go in setup.py for this to work?

# -*- coding: utf-8 -*-
# cython: profile=False

try: string_types = basestring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

this is the point of pandas.compat

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to avoid non-cython imports. Will change.

tzstr as _dateutil_tzstr)

import sys
if sys.platform == 'win32' or sys.platform == 'cygwin':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why are you not using the idiomatic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was copied verbatim from pd.compat, so I'm not sure what idiom it misses. But as above the idea was to avoid a non-cython import. Will change.

cdef int64_t NPY_NAT = np.datetime64('nat').astype(np.int64)


# TODO: Does this belong somewhere else?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR predates #17289. Specifying C dependencies in setup.py is a PITA and I wanted this module to be portable. So I copied total_seconds instead of getting it from src/datetime_helper.h.

Regardless of whether #17289 is merged, this function can be removed. Coming right up.

@jreback
Copy link
Contributor

jreback commented Aug 24, 2017

pls pls pls limit the changes to the minimal set that are required. mixing changes makes this much harder to review.

@codecov
Copy link

codecov bot commented Aug 24, 2017

Codecov Report

Merging #17274 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17274      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         162      162              
  Lines       49558    49558              
==========================================
- Hits        45105    45096       -9     
- Misses       4453     4462       +9
Flag Coverage Δ
#multiple 88.77% <ø> (ø) ⬆️
#single 40.25% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bec750...1a62298. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 24, 2017

Codecov Report

Merging #17274 into master will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17274      +/-   ##
==========================================
- Coverage   91.03%   90.99%   -0.05%     
==========================================
  Files         162      162              
  Lines       49567    49567              
==========================================
- Hits        45125    45104      -21     
- Misses       4442     4463      +21
Flag Coverage Δ
#multiple 88.77% <ø> (-0.03%) ⬇️
#single 40.24% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️
pandas/core/indexing.py 93.94% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 473a7f3...0540e77. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Gonna push on this a little bit. While #17363 and #17342 are "just" about making code more self-contained and maintainable, this PR is a blocker for some more performance-relevant fixes.

In particular, the end goal is to move tseries.offsets.DateOffset and tseries.frequencies.to_offset into cython, avoid the runtime import of to_offset, and make DateOffset immutable. Being able to import timezone functions without importing all of tslib makes it possible to move large chunks of frequencies and offsets into cython.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see an asv run of all timezone & time realted benchmarks. (IOW show the diffs)

@@ -4081,6 +4013,7 @@ def pydt_to_i8(object pydt):
return ts.value


# TODO: Never used?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can take this out (this function)

@@ -0,0 +1,3 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls don't add shebangs anywhere.

@@ -0,0 +1,3 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally you should be adding __init__.pyx for cython packages (IIRC you can then still import)

bint PyArray_IsIntegerScalar(object)

cdef inline bint is_integer_object(object obj):
# Ported from util (which gets it from numpy_helper) to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a useful comment. Why are you not simply importing from numpy_helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

The thought was that the marginal flexibility of going from 1 intra-pandas dependency to 0 was big enough to merit the redundant one-liner. Plus avoids having to tinker with setup.py and debug the inevitable segfaults that go with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeating low-level code is much worse

return _get_zone(tz)


cdef bint _is_utc(object tz):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can all be inline (though if they give warning then prob not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

@jbrockmendel
Copy link
Member Author

I want to see an asv run of all timezone & time realted benchmarks. (IOW show the diffs)

asv is complaining about build errors, so I'll put together a handful of measurements just using %timeit. There isn't a Timestamp/timezone-specific benchmark file, though I think there is one added in #17331

@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

asv is complaining about build errors, so I'll put together a handful of measurements just using %timeit. There isn't a Timestamp/timezone-specific benchmark file, though I think there is one added in #17331

pls don't. we use asv for a reason to avoid ad-hoc things like this.

@jbrockmendel
Copy link
Member Author

pls don't. we use asv for a reason to avoid ad-hoc things like this.

OK... what's the difference between this and the other times you've told me to use %timeit?

asv continuous -f 1.1 -E virtualenv HEAD -b timeseries.py
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
·· Installing into virtualenv-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
·· Error running ~/pd/tslibs-timezones/pandas/asv_bench/env/dbb8113eac34359a9ef26b93e54a1d03/bin/python ~/.local/lib/python2.7/site-packages/asv/benchmark.py discover ~/pd/tslibs-timezones/pandas/asv_bench/benchmarks /tmp/tmpUDGky2
             STDOUT -------->
             
             STDERR -------->
             Traceback (most recent call last):
               File "~/.local/lib/python2.7/site-packages/asv/benchmark.py", line 867, in <module>
                 commands[mode](args)
               File "~/.local/lib/python2.7/site-packages/asv/benchmark.py", line 812, in main_discover
                 list_benchmarks(benchmark_dir, fp)
               File "~/.local/lib/python2.7/site-packages/asv/benchmark.py", line 797, in list_benchmarks
                 for benchmark in disc_benchmarks(root):
               File "~/.local/lib/python2.7/site-packages/asv/benchmark.py", line 696, in disc_benchmarks
                 for module in disc_files(root, os.path.basename(root) + '.'):
               File "~/.local/lib/python2.7/site-packages/asv/benchmark.py", line 648, in disc_files
                 module = import_module(package + filename)
               File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
                 __import__(name)
               File "~/pd/tslibs-timezones/pandas/asv_bench/benchmarks/index_object.py", line 1, in <module>
                 from .pandas_vb_common import *
               File "~/pd/tslibs-timezones/pandas/asv_bench/benchmarks/pandas_vb_common.py", line 1, in <module>
                 from pandas import *
               File "~/pd/tslibs-timezones/pandas/asv_bench/env/dbb8113eac34359a9ef26b93e54a1d03/local/lib/python2.7/site-packages/pandas/__init__.py", line 35, in <module>
                 "the C extensions first.".format(module))
             ImportError: C extension: No module named tslibs.timezones not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

Same deal with py3.5 except it's complaining about period.pquarter

@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

this ian changing way more not-easy-to-inspect code differences. it's very easy to lose perf without constant benchmarking

if you are going to change significant code then get asv to work
it should be quite straightforward

i have never used this with virtual env can't say if it actually works properly (use conda)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Sep 1, 2017

The same error is occuring with conda. The same "ImportError: C extension: No module named tslibs.timezones not built." is occuring even if I just run asv continuous -f 1.1 master -b ^timeseries where there should be no tslibs.timezones. It's 100° out and I'm not inclined to spend any more time wrestling with a tool that I'm yet to be convinced gives meaningful measurements in the first place.

I'll make a series of PRs that move these functions over one at a time.

@jreback
Copy link
Contributor

jreback commented Sep 11, 2017

closing this as its replaced by other PR's. let's close older / extraneous PR's.

@jreback jreback closed this Sep 11, 2017
@jbrockmendel jbrockmendel deleted the tslibs-timezones branch October 30, 2017 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants