BUG: Don't error in pd.to_timedelta when errors=ignore #13832

gfyoung · 2016-07-29T08:32:20Z

Title is self-explanatory. Closes #13613.

jreback · 2016-07-29T10:27:18Z

pandas/tseries/timedeltas.py

+        try:
+            return Series(values, index=arg.index,
+                          name=arg.name, dtype='m8[ns]')
+        except ValueError:


this try/except needs to be inside _convert_listlike, e.g. If you have an invalid list/Index this will still not work (e.g. you are handling scalar/series only)

alternately, you could leave _convert_listlike and wrapit IT in another function if that is simpler / easier to follow

I have no idea what you said in the first comment. Please provide an example.

I don't like this try/except at all. values should already be dtyped (it could of course be m8[ns] or an Index or whatever). This make it very confusing. The fact that you had to write a comment is bad code smell here.

The comment was to clarify that the dtype of the returned array may not be convertible to m8[ns] if for example errors='ignore', and there's an error in the parsing. What about it is confusing? I was initially confused (before this PR) why we assumed we could cast to m8[ns] like this.

that's my point we don't do this in to_datetime and its wildly inconsistent to do it here when they have the same API

the return value needs to be validated before Series construction
that is the issue here / this is exactly what array\to_datetime does and I think is now inconsistent with array_to_timedelta

no reason for them to be different in how they act

@jreback: IMHO your attempt to keep everything consistent is somewhat in vain. Notice that in to_datetime we don't specify dtype, whereas in to_timedelta, we do. This means that regardless of what _convert_listlike returns in to_datetime, we can still return a Series. The same cannot be said for to_timedelta, where we also stipulate dtype=m8[ns]

@gfyoung if you have to do this, there is an embedded bug in the Series constructor. Pls don't catch errors just because its easy, its not correct.

The convert_list_like MUST return a valid index/scalar (if box=True), or a valid array/list/scalar if not, apparently its not in some cases.

@jreback : What do you mean "valid"? It's no different from _convert_listlike in to_datetime. There is no bug in the Series constructor. It's just that the case of no conversion being done was not considered. I struggle to understand what you are disagreeing with here.

@jreback : If you remove the dtype=m8[ns], no try-except will be needed. It is that the dtype specification alone that causes us to do this. There is nothing "invalid" about what _convert_listlike returns. The behaviour is exactly the same for to_datetime.

codecov-io · 2016-07-29T10:36:31Z

Current coverage is 85.28% (diff: 100%)

Merging #13832 into master will increase coverage by <.01%

@@             master     #13832   diff @@
==========================================
  Files           139        139          
  Lines         50211      50230    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42819      42837    +18   
- Misses         7392       7393     +1   
  Partials          0          0

Powered by Codecov. Last update 5f49638...dc39205

jreback · 2016-07-29T17:59:11Z

you need to add the list/Index tests as I indicated.

Further the problem is the separation of concerns here. the errors need to be totally handled in cython and NOT in python level. Because then you have the handling in 2 places which is problematic.

gfyoung · 2016-07-29T18:02:42Z

What are you talking about? Please read my comments first. I moved all of the error handling out of Cython for reasons that I outlined above.

jreback · 2016-07-29T18:05:23Z

that's not how Timestamp works - it needs to back

gfyoung · 2016-07-29T18:09:05Z

Again, I have no idea what you are saying.

gfyoung · 2016-08-05T15:48:26Z

@jreback : Any updates or replies to my comments?

jreback · 2016-08-08T13:28:01Z

@gfyoung so my issue with this is the following.

in the current code all handling of errors= for both Timedelta and Timestamp parsing (e.g. things like array_to_datetime) is almost exclusively handled in the cython path.

The main reason of course is that by-definition errors='coerce' needs this (as we are in a cython loop and need to continue while processing), while errors='ignore' is generally a try/except around this code (it still tries to process by now returns an object array).

So while your approach seems ok, it now changes the way a reader has to understand what is going on (in timedelta vs timestamp parsing). So I am open to changing things, just want to make them consistent.

Also I think its much easier to follow if its all in cython. No real reason to move it.

gfyoung · 2016-08-08T15:30:08Z

@jreback : The major issue with the current implementation was that there was significant coupling between parsing and error behaviour. For example, take a look at this line here. Why should Timestamp initialisation care about error handling?

Secondly, there was no real error handling at the Cython level (this bug attests to that), and IMO that that shouldn't be black-boxed into Cython but rather exposed at the Python level like we do with to_numeric for clarity. The Cython level should be for handling the actual conversion process (the meat of the function), while the Python level should be for how we handle the behaviour of the conversions.

jreback · 2016-08-08T15:40:45Z

@gfyoung pls re-read what I wrote. you are changing code that is very similar in array_to_datetime / convert_to_tsobject in structure. That is the problem. I am happy to have you fix it, but you need to keep to the same pattern or change both (which I don't think am happy with).

Secondly, there was no real error handling at the Cython level (this bug attests to that), and IMO that that shouldn't be black-boxed into Cython but rather exposed at the Python level like we do with to_numeric for clarity. The Cython level should be for handling the actual conversion process (the meat of the function), while the Python level should be for how we handle the behaviour of the conversions.

not sure what you mean by this.

I am happy to have this work, but it has to be consistent code.

gfyoung · 2016-08-08T15:57:49Z

@jreback : I'm not sure why it matters so much that the format be the same here. What's the justification for keeping it? Here is my justification for writing it the way I did.

My point is that the conversion process should be separated into two parts: 1) The actual conversion, and 2) How we handle errors. The current implementation of to_datetime couples them all together, forcing us to make clumsy calls (i.e. here as I mentioned above) where we have to specify error handling for no good reason at all.

In addition, the conversion process becomes littered with try-except blocks that arguably obfuscate what is happening at the Cython level.

The framework I have proposed here is exactly what happens in to_numeric, and it makes it nice and clear at the Python level what is happening when we do the conversion. That same clarity is lacking in to_datetime and to_timedelta, so yes, my PR does change the way a reader has to understand what is going in to_datetime but IMO it is for the better.

jreback · 2016-08-08T16:00:35Z

@gfyoung as I said. I am happy to change. BUT it needs to be for BOTH to_timedelta and to_timestamp. changing code structure in only 1 is a no go.

gfyoung · 2016-08-08T16:01:59Z

@jreback : Again, I don't understand why this is such a big deal. But if you insist...

jreback · 2016-08-08T16:03:32Z

@gfyoung I am not saying the code in array_to_timestamp is great. But the consistency matters so much here.

gfyoung · 2016-08-08T16:04:39Z

@jreback : fair enough, but for org purposes, can the refactoring of to_datetime be done in a follow-up?

jreback · 2016-08-08T16:06:08Z

sure, but do it on top of this one.

gfyoung · 2016-08-08T16:09:23Z

@jreback : Fair enough. In that case, this should be good to merge if there are no other concerns.

jreback · 2016-08-08T16:12:03Z

@gfyoung right, but the point is I'd like to see the combined changes.

don't get me wrong I am all for stripping things out of tslib.pyx and moving to python where appropriate.

run an asv just to be sure in any event.

gfyoung · 2016-08-08T16:13:26Z

@jreback : Another PR that is rebased on this one...fair enough.

jreback · 2016-08-08T16:16:20Z

@gfyoung yes exactly. these are nice for a discrete series of changes. a bit more work, but allow one to work pretty easily.

gfyoung · 2016-08-09T04:15:36Z

@jreback : So trying to push error handling onto the Python level for to_datetime in the same way as I did for to_timedelta is very difficult. The main reason is that the Cython methods used in to_datetime are also exposed as Python methods (cpdef) unlike those in to_timedelta. Thus, moving the error handling out of the Cython level would be an API change.

gfyoung · 2016-08-09T04:34:50Z

Also, there already is error handling at the Python level for to_datetime (here and here). I don't believe then that your argument about consistency of handling errors at the Cython level is valid, and given my arguments above, any sort of refactoring of to_datetime is not necessary for this PR to be merged.

jreback · 2016-08-09T22:13:42Z

pandas/tseries/timedeltas.py

-            value = TimedeltaIndex(value, unit='ns', name=name)
-        return value
+    if errors not in ('ignore', 'raise', 'coerce'):
+        raise ValueError("errors must be one of 'ignore', "


is this tested?

Absolutely. That was one of the first tests that I wrote.

jreback · 2016-08-09T22:18:13Z

ok added some comments.

pls run an asv as well (not really sure we have enough asv converage though)

gfyoung · 2016-08-10T03:06:03Z

@jreback : Ran asv - I didn't see any noticeable perf drops.

jorisvandenbossche · 2016-08-10T15:56:53Z

Looks good to me.

Regarding the asv benchs, I don't think there are currently benchmarks that covert the coerce/ignore behaviour, so maybe nice to add some.

jreback · 2016-08-10T22:50:01Z

pandas/tslib.pyx

    cdef:
        Py_ssize_t i, n
        ndarray[int64_t] iresult
-        bint is_raise=errors=='raise', is_ignore=errors=='ignore', is_coerce=errors=='coerce'
-
-    assert is_raise or is_ignore or is_coerce


you need to validate the errors kw here I think (with an assert is fine)

the error keyword is validated in to_timedelta itself I think

Exactly. That's why I put the check in the first place.

sure but this still could use an assert
this helps future readers know the possibilities and avoid misteps

Fair enough. Added a check in array_to_timedelta64.

jreback · 2016-08-10T22:51:10Z

pls add some asvs, mainly for arrays of timedelta string parsing. This was previously in-line code, so like to see what happened.

gfyoung · 2016-08-11T05:21:46Z

@jreback : There are already benchmarks for arrays of timedelta string parsing. However, I added one for bad parsing to exercise the errors parameter.

jorisvandenbossche · 2016-08-15T13:51:56Z

@jreback This is OK to merge for me. If you still have specific objections, can you reiterate them?

@gfyoung Did you see any changes in performance for the asv benchs you added?

jreback · 2016-08-15T14:25:59Z

@jorisvandenbossche

the wrapping of the Series via try-except is incorrect as I have noted several times. This is a bug somewhere internally in construction.

So I am -1 on merging until this is addressed.

jorisvandenbossche · 2016-08-15T14:35:17Z

@jreback Can you explain what you think the bug is in the Series constructor?

The try except block is to catch unconverted values, eg if you have pd.to_timedelta(pd.Series(['apple']), errors='ignore') (the original issue, this raises on master while it should return the original Series), those values will be fed into the Series call, like:

In [71]: Series(['apple'], dtype='m8[ns]')
...
ValueError: invalid literal for int() with base 10: 'apple'

Is it here you see a bug in the Series constructor? I think the above seems correct.
If this fails (like above, which means no conversion was made), the try-except block ensures to return the original values.

jreback · 2016-08-15T15:07:22Z

@jorisvandenbossche I have already explained this multiple times. This is not following the guarantees. _convert_list MUST return a valid Index. The fact that a try/except is needed for a specific case is the problem.

jorisvandenbossche · 2016-08-15T15:22:32Z

This is not following the guarantees. _conver_list MUST return a valid Index.

Well, so it is at least not a bug in the Series constructor :-) (that's what I understood you said, https://github.com/pydata/pandas/pull/13832/files#r74401562)

But, eg in the to_datetime implementation, _convert_listlike does not always return a valid DatetimeIndex. If you pass it invalid strings with errors='ignore', it also returns those original strings. So there is no difference here AFAIK.

@gfyoung Is the dtype='m8[ns]' is the Series call actually needed? I would suppose the return value of _convert_listlike is either 'm8[ns]' or unconvertible. So just wrapping it in a Series (without the dtype and try-except) would maybe also work?

gfyoung · 2016-08-15T15:47:41Z

@jorisvandenbossche : That would be an option to explore. I have been busy as of late but can take a look later and see if it can be removed it.

But I should say that I agree 100% with what you are saying. These were points I was trying to make before to @jreback in our discussion about the wrapping.

Also, the asv didn't show any drastic changes in performance AFAICT.

Closes pandas-devgh-13613.

gfyoung · 2016-08-15T17:38:08Z

@jorisvandenbossche , @jreback : It appears you can remove dtype=m8[ns] (tests pass locally), which obviates the need for the try-except block. Let's see what Travis has to say about that.

gfyoung · 2016-08-15T19:59:13Z

@jorisvandenbossche , @jreback : Travis agrees that the dtype=m8[ns] can be removed. As that wrapper issue was the major blocker to this PR, this should be ready to merge now since it has been removed with the modification.

jreback · 2016-08-15T22:27:39Z

thanks @gfyoung

jreback added Bug Timedelta Timedelta data type labels Jul 29, 2016

jreback reviewed Jul 29, 2016
View reviewed changes

gfyoung force-pushed the to-timedelta-error-bug branch from 5435014 to a8adc88 Compare July 30, 2016 08:56

gfyoung force-pushed the to-timedelta-error-bug branch from a8adc88 to 49b8d67 Compare August 5, 2016 15:52

jreback reviewed Aug 9, 2016
View reviewed changes

jorisvandenbossche added this to the 0.19.0 milestone Aug 10, 2016

jreback reviewed Aug 10, 2016
View reviewed changes

gfyoung force-pushed the to-timedelta-error-bug branch from 49b8d67 to ef586b8 Compare August 11, 2016 05:21

jreback removed this from the 0.19.0 milestone Aug 11, 2016

jorisvandenbossche added this to the 0.19.0 milestone Aug 15, 2016

jreback removed this from the 0.19.0 milestone Aug 15, 2016

BUG: Don't error in pd.to_timedelta when errors=ignore

dc39205

Closes pandas-devgh-13613.

gfyoung force-pushed the to-timedelta-error-bug branch from ef586b8 to dc39205 Compare August 15, 2016 17:37

jreback closed this in 8b50d8c Aug 15, 2016

jreback added this to the 0.19.0 milestone Aug 15, 2016

gfyoung deleted the to-timedelta-error-bug branch August 15, 2016 23:20

jreback mentioned this pull request Sep 7, 2016

TypeError bug when using pandas.to_timedelta (astype() unexpected argument 'copy') #14175

Closed

BUG: Don't error in pd.to_timedelta when errors=ignore #13832

BUG: Don't error in pd.to_timedelta when errors=ignore #13832

Conversation

gfyoung commented Jul 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jul 29, 2016 • edited Loading

Current coverage is 85.28% (diff: 100%)

jreback commented Jul 29, 2016

gfyoung commented Jul 29, 2016

jreback commented Jul 29, 2016

gfyoung commented Jul 29, 2016

gfyoung commented Aug 5, 2016

jreback commented Aug 8, 2016

gfyoung commented Aug 8, 2016 • edited Loading

jreback commented Aug 8, 2016

gfyoung commented Aug 8, 2016 • edited Loading

jreback commented Aug 8, 2016

gfyoung commented Aug 8, 2016

jreback commented Aug 8, 2016

gfyoung commented Aug 8, 2016 • edited Loading

jreback commented Aug 8, 2016

gfyoung commented Aug 8, 2016

jreback commented Aug 8, 2016 • edited Loading

gfyoung commented Aug 8, 2016

jreback commented Aug 8, 2016

gfyoung commented Aug 9, 2016

gfyoung commented Aug 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 9, 2016 • edited Loading

gfyoung commented Aug 10, 2016

jorisvandenbossche commented Aug 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 10, 2016

gfyoung commented Aug 11, 2016

jorisvandenbossche commented Aug 15, 2016

jreback commented Aug 15, 2016

jorisvandenbossche commented Aug 15, 2016 • edited Loading

jreback commented Aug 15, 2016 • edited Loading

jorisvandenbossche commented Aug 15, 2016

gfyoung commented Aug 15, 2016 • edited Loading

gfyoung commented Aug 15, 2016

gfyoung commented Aug 15, 2016

jreback commented Aug 15, 2016

gfyoung Aug 10, 2016 •

edited

Loading

gfyoung Aug 11, 2016 •

edited

Loading

jreback Aug 11, 2016 •

edited

Loading

gfyoung Aug 11, 2016 •

edited

Loading

codecov-io commented Jul 29, 2016 •

edited

Loading

gfyoung commented Aug 8, 2016 •

edited

Loading

gfyoung commented Aug 8, 2016 •

edited

Loading

gfyoung commented Aug 8, 2016 •

edited

Loading

jreback commented Aug 8, 2016 •

edited

Loading

jreback commented Aug 9, 2016 •

edited

Loading

jorisvandenbossche commented Aug 15, 2016 •

edited

Loading

jreback commented Aug 15, 2016 •

edited

Loading

gfyoung commented Aug 15, 2016 •

edited

Loading