Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Don't error in pd.to_timedelta when errors=ignore #13832

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 29, 2016

Title is self-explanatory. Closes #13613.

@jreback jreback added Bug Timedelta Timedelta data type labels Jul 29, 2016
try:
return Series(values, index=arg.index,
name=arg.name, dtype='m8[ns]')
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try/except needs to be inside _convert_listlike, e.g. If you have an invalid list/Index this will still not work (e.g. you are handling scalar/series only)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternately, you could leave _convert_listlike and wrapit IT in another function if that is simpler / easier to follow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what you said in the first comment. Please provide an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this try/except at all. values should already be dtyped (it could of course be m8[ns] or an Index or whatever). This make it very confusing. The fact that you had to write a comment is bad code smell here.

Copy link
Member Author

@gfyoung gfyoung Aug 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was to clarify that the dtype of the returned array may not be convertible to m8[ns] if for example errors='ignore', and there's an error in the parsing. What about it is confusing? I was initially confused (before this PR) why we assumed we could cast to m8[ns] like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's my point we don't do this in to_datetime and its wildly inconsistent to do it here when they have the same API

the return value needs to be validated before Series construction
that is the issue here / this is exactly what array\to_datetime does and I think is now inconsistent with array_to_timedelta

no reason for them to be different in how they act

Copy link
Member Author

@gfyoung gfyoung Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: IMHO your attempt to keep everything consistent is somewhat in vain. Notice that in to_datetime we don't specify dtype, whereas in to_timedelta, we do. This means that regardless of what _convert_listlike returns in to_datetime, we can still return a Series. The same cannot be said for to_timedelta, where we also stipulate dtype=m8[ns]

Copy link
Contributor

@jreback jreback Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung if you have to do this, there is an embedded bug in the Series constructor. Pls don't catch errors just because its easy, its not correct.

The convert_list_like MUST return a valid index/scalar (if box=True), or a valid array/list/scalar if not, apparently its not in some cases.

Copy link
Member Author

@gfyoung gfyoung Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : What do you mean "valid"? It's no different from _convert_listlike in to_datetime. There is no bug in the Series constructor. It's just that the case of no conversion being done was not considered. I struggle to understand what you are disagreeing with here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : If you remove the dtype=m8[ns], no try-except will be needed. It is that the dtype specification alone that causes us to do this. There is nothing "invalid" about what _convert_listlike returns. The behaviour is exactly the same for to_datetime.

@codecov-io
Copy link

codecov-io commented Jul 29, 2016

Current coverage is 85.28% (diff: 100%)

Merging #13832 into master will increase coverage by <.01%

@@             master     #13832   diff @@
==========================================
  Files           139        139          
  Lines         50211      50230    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42819      42837    +18   
- Misses         7392       7393     +1   
  Partials          0          0          

Powered by Codecov. Last update 5f49638...dc39205

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

you need to add the list/Index tests as I indicated.

Further the problem is the separation of concerns here. the errors need to be totally handled in cython and NOT in python level. Because then you have the handling in 2 places which is problematic.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 29, 2016

What are you talking about? Please read my comments first. I moved all of the error handling out of Cython for reasons that I outlined above.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

that's not how Timestamp works - it needs to back

@gfyoung
Copy link
Member Author

gfyoung commented Jul 29, 2016

Again, I have no idea what you are saying.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 5, 2016

@jreback : Any updates or replies to my comments?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung so my issue with this is the following.

in the current code all handling of errors= for both Timedelta and Timestamp parsing (e.g. things like array_to_datetime) is almost exclusively handled in the cython path.

The main reason of course is that by-definition errors='coerce' needs this (as we are in a cython loop and need to continue while processing), while errors='ignore' is generally a try/except around this code (it still tries to process by now returns an object array).

So while your approach seems ok, it now changes the way a reader has to understand what is going on (in timedelta vs timestamp parsing). So I am open to changing things, just want to make them consistent.

Also I think its much easier to follow if its all in cython. No real reason to move it.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : The major issue with the current implementation was that there was significant coupling between parsing and error behaviour. For example, take a look at this line here. Why should Timestamp initialisation care about error handling?

Secondly, there was no real error handling at the Cython level (this bug attests to that), and IMO that that shouldn't be black-boxed into Cython but rather exposed at the Python level like we do with to_numeric for clarity. The Cython level should be for handling the actual conversion process (the meat of the function), while the Python level should be for how we handle the behaviour of the conversions.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung pls re-read what I wrote. you are changing code that is very similar in array_to_datetime / convert_to_tsobject in structure. That is the problem. I am happy to have you fix it, but you need to keep to the same pattern or change both (which I don't think am happy with).

Secondly, there was no real error handling at the Cython level (this bug attests to that), and IMO that that shouldn't be black-boxed into Cython but rather exposed at the Python level like we do with to_numeric for clarity. The Cython level should be for handling the actual conversion process (the meat of the function), while the Python level should be for how we handle the behaviour of the conversions.

not sure what you mean by this.

I am happy to have this work, but it has to be consistent code.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : I'm not sure why it matters so much that the format be the same here. What's the justification for keeping it? Here is my justification for writing it the way I did.

My point is that the conversion process should be separated into two parts: 1) The actual conversion, and 2) How we handle errors. The current implementation of to_datetime couples them all together, forcing us to make clumsy calls (i.e. here as I mentioned above) where we have to specify error handling for no good reason at all.

In addition, the conversion process becomes littered with try-except blocks that arguably obfuscate what is happening at the Cython level.

The framework I have proposed here is exactly what happens in to_numeric, and it makes it nice and clear at the Python level what is happening when we do the conversion. That same clarity is lacking in to_datetime and to_timedelta, so yes, my PR does change the way a reader has to understand what is going in to_datetime but IMO it is for the better.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung as I said. I am happy to change. BUT it needs to be for BOTH to_timedelta and to_timestamp. changing code structure in only 1 is a no go.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : Again, I don't understand why this is such a big deal. But if you insist...

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung I am not saying the code in array_to_timestamp is great. But the consistency matters so much here.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : fair enough, but for org purposes, can the refactoring of to_datetime be done in a follow-up?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

sure, but do it on top of this one.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : Fair enough. In that case, this should be good to merge if there are no other concerns.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung right, but the point is I'd like to see the combined changes.

don't get me wrong I am all for stripping things out of tslib.pyx and moving to python where appropriate.

run an asv just to be sure in any event.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 8, 2016

@jreback : Another PR that is rebased on this one...fair enough.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

@gfyoung yes exactly. these are nice for a discrete series of changes. a bit more work, but allow one to work pretty easily.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 9, 2016

@jreback : So trying to push error handling onto the Python level for to_datetime in the same way as I did for to_timedelta is very difficult. The main reason is that the Cython methods used in to_datetime are also exposed as Python methods (cpdef) unlike those in to_timedelta. Thus, moving the error handling out of the Cython level would be an API change.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 9, 2016

Also, there already is error handling at the Python level for to_datetime (here and here). I don't believe then that your argument about consistency of handling errors at the Cython level is valid, and given my arguments above, any sort of refactoring of to_datetime is not necessary for this PR to be merged.

value = TimedeltaIndex(value, unit='ns', name=name)
return value
if errors not in ('ignore', 'raise', 'coerce'):
raise ValueError("errors must be one of 'ignore', "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. That was one of the first tests that I wrote.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

ok added some comments.

pls run an asv as well (not really sure we have enough asv converage though)

@gfyoung
Copy link
Member Author

gfyoung commented Aug 10, 2016

@jreback : Ran asv - I didn't see any noticeable perf drops.

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Aug 10, 2016
@jorisvandenbossche
Copy link
Member

Looks good to me.

Regarding the asv benchs, I don't think there are currently benchmarks that covert the coerce/ignore behaviour, so maybe nice to add some.

cdef:
Py_ssize_t i, n
ndarray[int64_t] iresult
bint is_raise=errors=='raise', is_ignore=errors=='ignore', is_coerce=errors=='coerce'

assert is_raise or is_ignore or is_coerce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to validate the errors kw here I think (with an assert is fine)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error keyword is validated in to_timedelta itself I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. That's why I put the check in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but this still could use an assert
this helps future readers know the possibilities and avoid misteps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Added a check in array_to_timedelta64.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2016

pls add some asvs, mainly for arrays of timedelta string parsing. This was previously in-line code, so like to see what happened.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 11, 2016

@jreback : There are already benchmarks for arrays of timedelta string parsing. However, I added one for bad parsing to exercise the errors parameter.

@jreback jreback removed this from the 0.19.0 milestone Aug 11, 2016
@jorisvandenbossche
Copy link
Member

@jreback This is OK to merge for me. If you still have specific objections, can you reiterate them?

@gfyoung Did you see any changes in performance for the asv benchs you added?

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Aug 15, 2016
@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

@jorisvandenbossche

the wrapping of the Series via try-except is incorrect as I have noted several times. This is a bug somewhere internally in construction.

So I am -1 on merging until this is addressed.

@jreback jreback removed this from the 0.19.0 milestone Aug 15, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 15, 2016

@jreback Can you explain what you think the bug is in the Series constructor?

The try except block is to catch unconverted values, eg if you have pd.to_timedelta(pd.Series(['apple']), errors='ignore') (the original issue, this raises on master while it should return the original Series), those values will be fed into the Series call, like:

In [71]: Series(['apple'], dtype='m8[ns]')
...
ValueError: invalid literal for int() with base 10: 'apple'

Is it here you see a bug in the Series constructor? I think the above seems correct.
If this fails (like above, which means no conversion was made), the try-except block ensures to return the original values.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

@jorisvandenbossche I have already explained this multiple times. This is not following the guarantees. _convert_list MUST return a valid Index. The fact that a try/except is needed for a specific case is the problem.

@jorisvandenbossche
Copy link
Member

This is not following the guarantees. _conver_list MUST return a valid Index.

Well, so it is at least not a bug in the Series constructor :-) (that's what I understood you said, https://github.com/pydata/pandas/pull/13832/files#r74401562)

But, eg in the to_datetime implementation, _convert_listlike does not always return a valid DatetimeIndex. If you pass it invalid strings with errors='ignore', it also returns those original strings. So there is no difference here AFAIK.

@gfyoung Is the dtype='m8[ns]' is the Series call actually needed? I would suppose the return value of _convert_listlike is either 'm8[ns]' or unconvertible. So just wrapping it in a Series (without the dtype and try-except) would maybe also work?

@gfyoung
Copy link
Member Author

gfyoung commented Aug 15, 2016

@jorisvandenbossche : That would be an option to explore. I have been busy as of late but can take a look later and see if it can be removed it.

But I should say that I agree 100% with what you are saying. These were points I was trying to make before to @jreback in our discussion about the wrapping.

Also, the asv didn't show any drastic changes in performance AFAICT.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 15, 2016

@jorisvandenbossche , @jreback : It appears you can remove dtype=m8[ns] (tests pass locally), which obviates the need for the try-except block. Let's see what Travis has to say about that.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 15, 2016

@jorisvandenbossche , @jreback : Travis agrees that the dtype=m8[ns] can be removed. As that wrapper issue was the major blocker to this PR, this should be ready to merge now since it has been removed with the modification.

@jreback jreback closed this in 8b50d8c Aug 15, 2016
@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

thanks @gfyoung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants