-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516
Comments
Can you benchmark just the change in python, and just the change in pandas separately? |
Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try newer pandas version on old Python. |
I think 0.23.2 is the first version of pandas to support 3.7
…On Mon, Nov 5, 2018 at 1:15 PM Gagi ***@***.***> wrote:
Can you benchmark just the change in python, and just the change in pandas
separately?
Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try
newer pandas version on old Python.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23516 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIm27-U5HB6S2JcM2DPtr6YzlPC90ks5usI48gaJpZM4YPD2Y>
.
|
I ran the test on an older Python 3.5 stack with the latest Pandas version 0.23.4 but with a lot of older versions of other modules and it looks to be running faster on Python 3.5. Now, I'm not quite sure if its pandas directly on python 3.7.1 or one of its dependencies. Does the parser's
Installed Python 3.5 Stack With Latest pandas 0.23.4: INSTALLED VERSIONScommit: None pandas: 0.23.4 <\details> |
Interestingly if I specify Python 3.7.1
Python 3.5.2
|
Adding another data point. If I specify the 'python' engine. It looks like on Python 3.7.1 Could this be due to the cython version? Python 3.7.1
Python 3.5.2
|
These last numbers are with what pandas version? |
They are both Pandas 0.23.4 |
I tried building the latest Pandas version from source on Python 3.7.1 and still got the same slower performance. Are there any build/compile/cython flags I can set to optimize the parser? |
the entire perf issue is simply the precision flag you can choose higher precision but it takes more time; this is rarely useful though |
I tried all three different I also tried specifying a Can you reproduce in Python 3.6? I should reiterate this perf difference is on the same version of Pandas 0.23.4 just different version of Python. |
is there a way to specify 'xstrtod', or is that specified by float_precision=None? I see no performance changes between 'high' and None. |
Is anyone able to reproduce this on Python 3.7.1? I tested the code above on Python 3.7.0 using the Python.org interactive interpreter and it seemed to run faster than in my local 3.7.1 install. |
Something is definitely up. I did a side by side compare reading the same CSV file on disk. Python 3.5 reads at 111 MB/sec and Python 3.7 reads at only 28 MB/Sec from the same SSD. Both running Pandas 0.23.4. Could Python 3.7 have changed something in their I/O system? Python 3.5.2 & Pandas 0.23.4
Python 3.7.1 & Pandas 0.23.4
|
I don't see the difference you're seeing
3.7
Both of those are using Anaconda's packages. |
Tom, thanks for running this benchmark. Can you post your pd.show_versions() I want to re-create your stack exactly to do some more testing. |
3.5
3.7:
|
I tried several different fresh python installs on windows. Every Python 3.7 install 32 or 64 bit with Pandas 0.23.4 pip installed results in the slower CSV parsing speed. For fun I tried Installing a fresh Python 3.6.7 install and it again parses the same CSV 3X faster. Is there anyone that could test this on Windows 10 and Python 3.7.1? 😕 |
cc @chris-b1 in case you can test on Windows |
Indeed, I can confirm that there is a 3.5X slowdown when using Python 3.7.1 on Windows 10. When I use Python 3.5.6, the performance is unchanged from These observations are consistent with what @dragoljub was observing and appears to suggest that this is a Cython / Python suggest and not |
On windows 10, python 3.6 and python 3.7 I note a noticeable slowdown as well. (py36) PS C:\Users\ttttt> ipython
Python 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 7.03 s
In [3]: %time _ = pd.read_csv('out.csv')
Wall time: 7.04 s python 3.7 (py37) PS C:\Users\ttttt> ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
In [6]: df.to_csv('out.csv')
In [7]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 29.4 s
In [8]: %time _ = pd.read_csv('out.csv')
Wall time: 31.3 s |
For people on windows, how are you installing pandas? From source, wheels, or conda packages? And if conda, from defaults or from conda-forge? |
Here conda-forge: PS C:\Users\ttttt> activate py37
(py37) PS C:\Users\ttttt> conda install ipython pandas
Solving environment: done
## Package Plan ##
environment location: C:\Miniconda\envs\py37
added / updated specs:
- ipython
- pandas
The following packages will be downloaded:
package | build
---------------------------|-----------------
ipython-7.1.1 |py37h39e3cac_1000 1.1 MB conda-forge
wcwidth-0.1.7 | py_1 17 KB conda-forge
six-1.11.0 | py37_1001 21 KB conda-forge
pytz-2018.7 | py_0 226 KB conda-forge
icc_rt-2017.0.4 | h97af966_0 8.0 MB
pygments-2.2.0 | py_1 622 KB conda-forge
pickleshare-0.7.5 | py37_1000 12 KB conda-forge
certifi-2018.10.15 | py37_1000 137 KB conda-forge
backcall-0.1.0 | py_0 13 KB conda-forge
mkl_random-1.0.1 | py37h77b88f5_1 267 KB
decorator-4.3.0 | py_0 10 KB conda-forge
numpy-1.15.4 | py37ha559c80_0 36 KB
mkl-2019.0 | 118 178.1 MB
pandas-0.23.4 |py37h830ac7b_1000 8.7 MB conda-forge
prompt_toolkit-2.0.7 | py_0 218 KB conda-forge
python-dateutil-2.7.5 | py_0 218 KB conda-forge
colorama-0.4.0 | py_0 15 KB conda-forge
mkl_fft-1.0.6 | py37hdbbee80_0 120 KB
jedi-0.13.1 | py37_1000 228 KB conda-forge
intel-openmp-2019.0 | 118 1.7 MB
parso-0.3.1 | py_0 59 KB conda-forge
traitlets-4.3.2 | py37_1000 130 KB conda-forge
ipython_genutils-0.2.0 | py_1 21 KB conda-forge
numpy-base-1.15.4 | py37h8128ebf_0 3.9 MB
blas-1.0 | mkl 6 KB
------------------------------------------------------------
Total: 203.7 MB |
Thanks @toniatop. Can you create a couple environments with just defaults to see if it's an issue with how it was compiled for conda-forge? |
Redone everything forcing --channel anaconda, same results. |
cc @jjhelmus any thoughts on
|
Good info. I'm just surprised that people do not see this on Linux. I'll try OSX next. |
FYI my timings above were on OSX (no slowdown)
…On Fri, Nov 9, 2018 at 2:16 PM Gagi ***@***.***> wrote:
I'd be surprised that change matters, but I'm at a loss here, so maybe!
Another possibility is that cython made some tweaks to threading logic for
python 3.7 compat - again, wouldn't think that's the issue here, but
possible some kind of bad interaction.
cython/cython#1978 <cython/cython#1978>
Good info. I'm just surprised that people do not see this on Linux. I'll
try OSX next.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23516 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIl8M-FYdJ366ZxFHd1Rdur0sKyLEks5uteKAgaJpZM4YPD2Y>
.
|
Looks like the slowdown first shows up in
|
Very Interesting. I'll try Py 3.7.0a3 to confirm this on my systems. Is the diff between |
https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4 Maybe See also python/cpython@v3.7.0a3...v3.7.0a4 |
I can also confirm the changes from Python 3.7.0a3 to 3.7.0a4 show the slowdown on my Win10 test system. Thanks for finding when the slowdown occurred. Python 3.7.0a3 -- Fast Parse %prun df2 = pd.read_csv(csv)
5781 function calls (5743 primitive calls) in 3.062 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.953 2.953 2.955 2.955 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
1 0.063 0.063 0.063 0.063 internals.py:5017(_stack_arrays)
1 0.016 0.016 3.052 3.052 parsers.py:414(_read)
1 0.009 0.009 3.062 3.062 <string>:1(<module>)
1 0.009 0.009 0.009 0.009 parsers.py:1685(__init__)
32 0.004 0.000 0.004 0.000 {built-in method nt.stat}
1 0.001 0.001 0.001 0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
321 0.001 0.000 0.002 0.000 common.py:811(is_integer_dtype)
516 0.001 0.000 0.001 0.000 common.py:1835(_get_dtype_type)
7 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty}
32 0.000 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1235(find_spec)
988 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
163 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)
718 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
192 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
192 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:57(_path_join)
8 0.000 0.000 0.005 0.001 <frozen importlib._bootstrap_external>:1119(_get_spec)
133 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
68 0.000 0.000 0.000 0.000 generic.py:7(_check) Python 3.7.0a4 -- Slow Parse %prun df2 = pd.read_csv(csv)
8007 function calls (7219 primitive calls) in 14.192 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 14.092 14.092 14.094 14.094 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
1 0.061 0.061 0.062 0.062 internals.py:5017(_stack_arrays)
1 0.016 0.016 14.192 14.192 parsers.py:414(_read)
1 0.008 0.008 0.008 0.008 parsers.py:1685(__init__)
32 0.004 0.000 0.004 0.000 {built-in method nt.stat}
1 0.001 0.001 0.001 0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
321 0.001 0.000 0.002 0.000 common.py:811(is_integer_dtype)
516 0.001 0.000 0.001 0.000 common.py:1835(_get_dtype_type)
7 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty}
115/4 0.000 0.000 0.001 0.000 abc.py:194(__subclasscheck__)
32 0.000 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1322(find_spec)
1324/988 0.000 0.000 0.002 0.000 {built-in method builtins.isinstance}
937/725 0.000 0.000 0.002 0.000 {built-in method builtins.issubclass}
163 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)
192 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:57(_path_join)
192 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
8 0.000 0.000 0.005 0.001 <frozen importlib._bootstrap_external>:1206(_get_spec)
89/78 0.000 0.000 0.000 0.000 {built-in method builtins.len}
192 0.000 0.000 0.000 0.000 {built-in method builtins.getattr} |
I tried playing around with the UTF-8 mode settings with ENV variables and cmd line args on Windows and was not able to get faster parsing speed on Python 3.7.0a4. https://www.python.org/dev/peps/pep-0540/#proposal
So is it possible that somewhere the C parser extension we can just set the locale to UTF-8 and this issue would go away on Windows? I was hoping the ENV variable settings would fix the issue but it did not make a difference in my testing. |
I compared the statement
|
Great work debugging this. I would guess any other code paths calling isdigit would also be slowed down on windows. |
Just a note for people looking at |
I may have found a pure python example that seems to show a similar but smaller 2.5X slowdown. Also note the variability is 15X higher for the 3.7.1 code. Possibly indicating that locale argument is passed/used in some calls but not others. Can someone test this on linux and see if you see a difference? Python 3.7.1 digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
537 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Python 3.5.2 digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
215 ms ± 986 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) --> Based on comments from: https://bugs.python.org/msg329789 @cgohlke has posed a nice minimal example showing the slowdown: https://bugs.python.org/msg329790 Thanks! 👍 |
Thanks for the investigation @cgohlke - for 0.24 I suppose we should just shim in an ASCII (MUSL, MIT licensed) |
@chris-b1 I was thinking the same thing since its quite a simple function, however than changing locale would be limited. I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available. |
The source code for the Windows UCRT is available with recent Windows SDK. It is usually installed under The extern "C" extern __inline int (__cdecl isdigit)(int const c)
{
return __acrt_locale_changed()
? (_isdigit_l)(c, nullptr)
: fast_check(c, _DIGIT);
}
extern "C" extern __inline int (__cdecl _isdigit_l)(int const c, _locale_t const locale)
{
_LocaleUpdate locale_update(locale);
return _isdigit_l(c, locale_update.GetLocaleT());
} The following comment is from the // If no call has been made to setlocale to change locale from "C" locale
// to some other locale, we keep locale_changed = 0. Other functions that
// depend on locale use this variable to optimize performance for C locale
// which is normally the case in applications. |
So if I’m understanding it correctly. Even if we set the locale in Python to “C” the windows isdigit function would still resort to calling the locale isdigit version slowing down parsing because the local has ‘changed’. Is that the case in python. 3.7.0a3? Setting locale to “C” slows parsing down? |
@jreback @TomAugspurger Do you think a simple shim of the isdigit function in the C parser code would be a fix we could entertain? It would assume 'ASCII' compatible encoding for numeric columns which I think should cover all/most csv file encodings for digits. int isdigit(int c)
{
return (unsigned)c-'0' < 10;
} |
Yeah, if you want to submit PR, ping me, or if not, I'll try to get to it soon |
@chris-b1 Go for it! 😄 |
Code Sample, a copy-pastable example if possible
Problem description
pd.read_csv()
using_libs.parsers.TextReader
read()
method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.Expected Output
Output of
pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.9.2
pip: 18.1
setuptools: 40.4.3
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.11.0
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
Output of
pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 20.10.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: