-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iloc indexing ~10 times slower than direct column indexing +no documention of it #29316
Comments
what does this have to do with pandas? what is yfinance? show a self contained example |
@jreback this has been done. Thanks for the advice. |
It is even stranger than I first noticed. Apparently using iloc on the column is almost twice as fast than just using standard Python list indexing with brackets:
This is too counter-intuitive to not be thoroughly documented at minimum. As far as I can tell - iloc is optimized for integers so it makes sense that no data inspection is needed (since standard indexing checks if the type is same as the index data type). Still the time increase is too dramatic. Since a major goal of this library is for fast processing of data sets. |
individual scalar selections are rarely important when dealing with vectorized data. |
closing as a user question. |
Code Sample, a copy-pastable example if possible
Updated without yfinance as it was a lousy choice of example - requires only standard numpy and datetime. A DateTimeIndex exacerbates the time slowness.
Old code example:#requires pip install yfinance for this example - could substitute with any data settimeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)#5.178573199999846timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)#0.5886300999998184Problem description
One would naturally expect iloc to have close to identical efficiency as column indexing. This is a dramatic blunder in a lot of code more than likely if it is not well documented and known. I would expect some strange inefficiency is present as logically it could be slightly slower to return a row rather than a data point. But 10 times slower is so dramatic that this function should be avoided.
Expected Output
Less than double the speed of single column value access. Or at least thorough documentation of this limitation.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None
pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
The text was updated successfully, but these errors were encountered: