Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Imprecise indexer #22043

Closed
wants to merge 5 commits into from

Conversation

WeatherGod
Copy link

This is a work-in-progress to add a tolerance attribute to the Index class, and to plumb its use throughout the Index machinery. My immediate goal at this point is to not break anything. There is still a lot more work to do before this is ready for prime time, but hopefully I can get some inputs on best practices for mucking about in such deep internals of pandas.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Jul 24, 2018

Hello @WeatherGod! Thanks for updating the PR.

Line 1267:9: E265 block comment should start with '# '
Line 1411:5: E265 block comment should start with '# '

Comment last updated on July 25, 2018 at 21:19 Hours UTC

@WillAyd
Copy link
Member

WillAyd commented Jul 24, 2018

Is this in reference to any particular issue or discussion?

@jreback
Copy link
Contributor

jreback commented Jul 24, 2018

you can already do this with reindex

what is the usecase?

@shoyer
Copy link
Member

shoyer commented Jul 24, 2018

The most relevant issues are probably #9530 and #9817, as well as pydata/xarray#2217 downstream.

The use case here is the ability to make indexes that always do alignment using a tolerance. Pandas' current automatic alignment is not so useful when using floating point indexes, because that alignment is done without any consideration of near matches.

@WeatherGod
Copy link
Author

Right, to summarize a bit, quite often with float64 indexes, you have two indexes which logically have similar keys, but because they were computed slightly differently, or came from different sources, they aren't binary identical.

I first tried implementing this just within Float64Index, but quickly ran into issues where I needed support implemented within the base class. Of course, once that happened, well, you need to implement a lot of this up into the other classes as well.

The basic premise of the design is that explicit will still always override implicit (which would be the tolerance attribute), which is why tolerance was added as an argument to many of the set operations. Also, any resulting indexes from these operations will have the tolerance that was used be set for its own tolerance attribute.

* took care of wrappers in datetimes and interval
* fix tolerance handling in extended dtype index construction
* fix unpickling of old pickles and a bug in numeric index unpickling
* fix tolerance for constructor delegation in `__new__`.
@gfyoung gfyoung added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Jul 25, 2018
@WeatherGod
Copy link
Author

My employeer has changed priorities for me, so I have been unable to pursue this work any further, and I don't foresee any free time to spend on this. I hope someone else can take this work further, even if it is just going through and adding documentation.

The other major effort needed in this PR is to update the cython helpers for tolerance support, and unit tests.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

nice idea. PR is stale. if you'd like to continue, pls ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants