iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

GregoryMorse · 2019-11-01T01:13:54Z

Code Sample, a copy-pastable example if possible

Updated without yfinance as it was a lousy choice of example - requires only standard numpy and datetime. A DateTimeIndex exacerbates the time slowness.

import timeit
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#3.553613900000073
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#0.5215671999999358
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#4.485623600000508
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#0.7185356999998476

~~Old code example:~~
~~#requires pip install yfinance for this example - could substitute with any data set~~
~~timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)~~
~~#5.178573199999846~~
~~timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)~~
~~#0.5886300999998184~~

Problem description

One would naturally expect iloc to have close to identical efficiency as column indexing. This is a dramatic blunder in a lot of code more than likely if it is not well documented and known. I would expect some strange inefficiency is present as logically it could be slightly slower to return a row rather than a data point. But 10 times slower is so dramatic that this function should be avoided.

Expected Output

Less than double the speed of single column value access. Or at least thorough documentation of this limitation.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2019-11-01T01:30:28Z

what does this have to do with pandas? what is yfinance? show a self contained example

GregoryMorse · 2019-11-01T01:54:40Z

@jreback this has been done. Thanks for the advice.

GregoryMorse · 2019-11-01T02:10:29Z

It is even stranger than I first noticed. Apparently using iloc on the column is almost twice as fast than just using standard Python list indexing with brackets:

timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1.iloc[x]', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
0.424690400000145

This is too counter-intuitive to not be thoroughly documented at minimum.

As far as I can tell - iloc is optimized for integers so it makes sense that no data inspection is needed (since standard indexing checks if the type is same as the index data type). Still the time increase is too dramatic. Since a major goal of this library is for fast processing of data sets.

jreback · 2019-11-01T11:20:19Z

In [20]: %timeit hist.Col1[1]                                                                                                                            
21.9 µs ± 367 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit hist.iloc[1].Col1                                                                                                                       
164 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit hist.iat[1, 0]                                                                                                                          
7.97 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

individual scalar selections are rarely important when dealing with vectorized data.

jreback · 2019-11-01T11:20:25Z

closing as a user question.

jreback closed this as completed Nov 1, 2019

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 1, 2019

jreback added this to the No action milestone Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

GregoryMorse commented Nov 1, 2019 •

edited

Loading

INSTALLED VERSIONS

jreback commented Nov 1, 2019

GregoryMorse commented Nov 1, 2019

GregoryMorse commented Nov 1, 2019 •

edited

Loading

jreback commented Nov 1, 2019

jreback commented Nov 1, 2019

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

Comments

GregoryMorse commented Nov 1, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Nov 1, 2019

GregoryMorse commented Nov 1, 2019

GregoryMorse commented Nov 1, 2019 • edited Loading

jreback commented Nov 1, 2019

jreback commented Nov 1, 2019

GregoryMorse commented Nov 1, 2019 •

edited

Loading

Output of `pd.show_versions()`

GregoryMorse commented Nov 1, 2019 •

edited

Loading