Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing categories #387

Closed
henryiii opened this issue Jul 2, 2020 · 8 comments · Fixed by #689
Closed

Indexing categories #387

henryiii opened this issue Jul 2, 2020 · 8 comments · Fixed by #689
Milestone

Comments

@henryiii
Copy link
Member

henryiii commented Jul 2, 2020

When filling, an int or str category will put values not in the axes either in a new bin (growth), a flow bin, or ignore it if overflow is off. However, the Python index method does the same thing - for example:

h = bh.Histogram(bh.axis.StrCategory(["a", "b"]))
h.fill(["c"])
print(h[bh.loc("d")]) # Returns 1.0

Also, if growth was on, this will not grow the axes, though at least one user expected that in the past.
Should Python indexes into non-existing string or int categories throw a TypeError or KeyError? I think this would be less error-prone. You could still use bh.overflow to access the flow bin explicitly.

Unlike axes with both an underflow and overflow, the categories represent a set of items, and Python tools for this sort of thing (like a dict) generally throw errors when you try to access a non-existent item (but not when setting it, so fill is still consistent).

@HDembinski ?

@HDembinski
Copy link
Member

Throwing a ValueError or KeyError may help.

@HDembinski
Copy link
Member

HDembinski commented Jul 3, 2020

I am not sure which one. A histogram with a category axis is not a dict, and we should not stress that analogy, so I am rather leaning toward ValueError.

@henryiii
Copy link
Member Author

henryiii commented Jul 5, 2020

As you correctly inferred, I meant ValueError above, not TypeError. :)

Let me check a couple of other packages, specifically Pandas, to see what they throw (when I'm not on iOS, Pythonista doesn't include Pandas, sadly).

@HDembinski
Copy link
Member

Good idea.

@henryiii
Copy link
Member Author

henryiii commented Jul 7, 2020

Pandas does seem to throw a KeyError, both for a DataFrame:

d = pd.DataFrame({"col_one": [1]})
d["col_two"] # throws a key error

Or for a Series (which is arguably closer to what we have):

s = pd.Series([1,2,3], ["a", "b", "c"])
s["d"] # Throws key error

@henryiii
Copy link
Member Author

henryiii commented Jul 9, 2020

Xarray (from the SciPy tutorial today) also uses KeyError (though it's using Pandas Indexers in the backend, so not shocking that it matches Pandas). Even for floating point coordinates. Xarray is very close to what we do. I would vote for KeyError.

@HDembinski
Copy link
Member

Ok.

@henryiii henryiii added this to the 1.0.0 milestone Jul 11, 2020
@henryiii henryiii modified the milestones: 1.0.0, 0.13.0, 1.1.0 Feb 9, 2021
@nsmith-
Copy link

nsmith- commented Feb 24, 2021

Just leaving a comment to say I ran into this (unexpected to me) behavior and look forward to the KeyError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants