Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: support "unique=True" in MultiIndex.get_level_values() #17896

Closed
toobaz opened this issue Oct 16, 2017 · 7 comments · Fixed by #17897
Closed

API: support "unique=True" in MultiIndex.get_level_values() #17896

toobaz opened this issue Oct 16, 2017 · 7 comments · Fixed by #17897

Comments

@toobaz
Copy link
Member

toobaz commented Oct 16, 2017

Code Sample, a copy-pastable example if possible

I often find my self doing

In [2]: df = pd.Series(index=pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']]))

In [3]: df.index.get_level_values(0).unique()
Out[3]: Index(['A', 'B'], dtype='object')

Problem description

The above is very inefficient, because first a Series is built which includes a copy of the entire level (possibly using way more memory than the index itself), and only then duplicates are stripped. Other people on SO have faced the same problem, and this is also blocking a fix I wrote for #17845.

I'm pushing a simple PR in seconds.

Expected Output

Same as above, but in an efficient way.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.21.0rc1+19.gb15d92d14
pytest: 3.0.6
pip: 9.0.1
setuptools: None
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.1.0.dev
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz
Copy link
Member Author

toobaz commented Oct 16, 2017

By the way: this is in principle related to #2770, which however is being tackled in a different and complementary way.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

how is this not just .get_levels(..) ?

@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

#2770 is handled by remove_unused_levels()

@toobaz
Copy link
Member Author

toobaz commented Oct 16, 2017

how is this not just .levels ?

.levels includes unused labels (which is why users are often confused by it)

@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

ok, you are adding it there, ok!.

I am not sure unique is the right word here.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

.get_level_values(level, used=False), though I am not sure I like this either.

@jorisvandenbossche
Copy link
Member

I agree it would be nice to have a clean way to get those unique values, but IMO it does not belong in get_level_values. That method returns the actual values of the Index level, with a length equal to the length of the Index, and IMO we should stick to that contract. Having such a keyword would completely alter the return type of this method.

(not directly a good idea for alternative though)

toobaz added a commit to toobaz/pandas that referenced this issue Oct 17, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Oct 17, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Oct 17, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Oct 17, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Oct 29, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Oct 30, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 11, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 11, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 11, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 12, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 14, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 15, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 18, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 18, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 18, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 19, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Nov 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants