-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor timezones functions out of tslib #17274
Conversation
timezones.pyx has no other dependencies within pandas, helping to de-tangle some of the _libs modules Code im timezones is used in both tslib and period, and a bit in index.pyx This is one of several steps in making _libs.period not need to import from non-cython code and ideally not need to import tslib (though NaT makes that tough). See existing comments in _libs.__init__ on why this is desireable. This is the first of several independent pieces to be split off of tslib. Cleanup commented-out code
Hello @jbrockmendel! Thanks for updating the PR.
Comment last updated on August 28, 2017 at 17:29 Hours UTC |
pandas/_libs/period.pyx
Outdated
from libc.stdlib cimport free | ||
|
||
from pandas import compat | ||
from pandas.compat import PY2 | ||
from pandas.core.dtypes.generic import ABCDateOffset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An upcoming PR will make a cython version of dtypes.generic
. The isinstance
checks are about 2x faster that way. It's a small difference, but nice to avoid calling into python for these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep perf changes separate. I would much prefer PR's that don't mix multiple changes all at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I just pushed a commit that reverted from ABCDateOffset
back to offsets.DateOffset
.
pandas/_libs/period.pyx
Outdated
if isinstance(other, (timedelta, np.timedelta64, | ||
offsets.Tick, offsets.DateOffset, | ||
Timedelta)): | ||
if isinstance(other, (timedelta, np.timedelta64, ABCDateOffset)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timedelta
subclasses timedelta
and Tick
subclasses DateOffset
, so we can remove them from these checks.
pandas/_libs/tslib.pyx
Outdated
@@ -3960,7 +3889,7 @@ for _maybe_method_name in dir(NaTType): | |||
def f(*args, **kwargs): | |||
raise ValueError("NaTType does not support " + func_name) | |||
f.__name__ = func_name | |||
f.__doc__ = _get_docstring(_method_name) | |||
f.__doc__ = _get_docstring(func_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated bug. This block of code exists to make NaT
raise ValueError
for each of a bunch of datetime
methods. This is supposed to attach the original docstring to the new function f
. The typo _method_name
instead of func_name
means that a bunch of the docstrings are currently wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then pls don't change here. rather in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change. Apologies.
Do you have any (empirical) timing information to support this statement? |
Brief discussion of profiling results here. |
Regarding the |
That would be excellent. If there's a concensus, I'll amend the PR. |
You will need to check that it is actually possible, but I suppose it will. |
@jorisvandenbossche I'll give that a try and see how it goes. Is it possible/likely that this would also explain why _libs/src/datetime/np_datetime.c duplicates a bunch of stuff from numpy 1.7? i.e. at a time when pre-1.7 versions of numpy were supported? |
Getting rid of |
…ndas into tslibs-timezones
Wild guess that it might be causing the build fail on appveyor
@@ -0,0 +1,3 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding this? this is NOT a package by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that it can be imported python-style. Or if you're referring to the shebang that's force of habit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls don't add shebangs anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally you should be adding __init__.pyx
for cython packages (IIRC you can then still import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally you should be adding init.pyx for cython packages (IIRC you can then still import)
Just renaming led to ImportError
. Does something need to go in setup.py for this to work?
pandas/_libs/tslibs/timezones.pyx
Outdated
# -*- coding: utf-8 -*- | ||
# cython: profile=False | ||
|
||
try: string_types = basestring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is this?
this is the point of pandas.compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to avoid non-cython imports. Will change.
pandas/_libs/tslibs/timezones.pyx
Outdated
tzstr as _dateutil_tzstr) | ||
|
||
import sys | ||
if sys.platform == 'win32' or sys.platform == 'cygwin': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why are you not using the idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was copied verbatim from pd.compat, so I'm not sure what idiom it misses. But as above the idea was to avoid a non-cython import. Will change.
pandas/_libs/tslibs/timezones.pyx
Outdated
cdef int64_t NPY_NAT = np.datetime64('nat').astype(np.int64) | ||
|
||
|
||
# TODO: Does this belong somewhere else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls pls pls limit the changes to the minimal set that are required. mixing changes makes this much harder to review. |
Codecov Report
@@ Coverage Diff @@
## master #17274 +/- ##
==========================================
- Coverage 91.01% 90.99% -0.02%
==========================================
Files 162 162
Lines 49558 49558
==========================================
- Hits 45105 45096 -9
- Misses 4453 4462 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17274 +/- ##
==========================================
- Coverage 91.03% 90.99% -0.05%
==========================================
Files 162 162
Lines 49567 49567
==========================================
- Hits 45125 45104 -21
- Misses 4442 4463 +21
Continue to review full report at Codecov.
|
Gonna push on this a little bit. While #17363 and #17342 are "just" about making code more self-contained and maintainable, this PR is a blocker for some more performance-relevant fixes. In particular, the end goal is to move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to see an asv run of all timezone & time realted benchmarks. (IOW show the diffs)
@@ -4081,6 +4013,7 @@ def pydt_to_i8(object pydt): | |||
return ts.value | |||
|
|||
|
|||
# TODO: Never used? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can take this out (this function)
@@ -0,0 +1,3 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls don't add shebangs anywhere.
@@ -0,0 +1,3 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally you should be adding __init__.pyx
for cython packages (IIRC you can then still import)
bint PyArray_IsIntegerScalar(object) | ||
|
||
cdef inline bint is_integer_object(object obj): | ||
# Ported from util (which gets it from numpy_helper) to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a useful comment. Why are you not simply importing from numpy_helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change.
The thought was that the marginal flexibility of going from 1 intra-pandas dependency to 0 was big enough to merit the redundant one-liner. Plus avoids having to tinker with setup.py and debug the inevitable segfaults that go with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeating low-level code is much worse
return _get_zone(tz) | ||
|
||
|
||
cdef bint _is_utc(object tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can all be inline (though if they give warning then prob not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change.
asv is complaining about build errors, so I'll put together a handful of measurements just using %timeit. There isn't a Timestamp/timezone-specific benchmark file, though I think there is one added in #17331 |
pls don't. we use asv for a reason to avoid ad-hoc things like this. |
OK... what's the difference between this and the other times you've told me to use %timeit?
Same deal with py3.5 except it's complaining about |
this ian changing way more not-easy-to-inspect code differences. it's very easy to lose perf without constant benchmarking if you are going to change significant code then get asv to work i have never used this with virtual env can't say if it actually works properly (use conda) |
The same error is occuring with conda. The same "ImportError: C extension: No module named tslibs.timezones not built." is occuring even if I just run I'll make a series of PRs that move these functions over one at a time. |
closing this as its replaced by other PR's. let's close older / extraneous PR's. |
_libs.tslib
is over 5k lines and is imported by a bunch of other modules including_libs.lib
. It looks like it was pasted together from a bunch of older files. There are a handful of areas where significant chunks can be refactored out in complexity-reducing (and testability-increasing) ways. This is the first one: timezones. (Next up: parsing (and gathering parsing code dispersed across pandas))timezones.pyx has no other dependencies within pandas, helping to de-tangle some of the
_libs
modulesCode in timezones is used in both
_libs.tslib
and_libs.period
, and a bit in_libs.index
.This is one of several steps in making _libs.period not need to import from non-cython code and
ideally not need to import
tslib
(thoughNaT
makes that tough). See existing commentsin
_libs.__init__
on why this is desireable.This is the first of several independent pieces to be split off of
tslib
.There are also several notes on functions that appear to be unused and may be ready for removal.
Removes
datetime_helper
dependency from most of_libs
, as it is somehow slower than a plain cython version. In cases where C can be replaced by idiomatic cython without hurting performance, I'm calling that a win.git diff upstream/master -u -- "*.py" | flake8 --diff