Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516

Closed
dragoljub opened this issue Nov 5, 2018 · 56 comments · Fixed by #23981
Closed
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@dragoljub
Copy link

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)

Problem description

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.

 4244 function calls (4210 primitive calls) in 10.273 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.202   10.202   10.204   10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.011    0.011   10.262   10.262 parsers.py:414(_read)
        1    0.011    0.011   10.273   10.273 <string>:1(<module>)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)

Expected Output

3229 function calls (3222 primitive calls) in 2.944 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.881    2.881    2.882    2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.045    0.045    0.045    0.045 internals.py:4801(_stack_arrays)
        1    0.010    0.010    2.944    2.944 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
      320    0.001    0.000    0.001    0.000 common.py:777(is_integer_dtype)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.p

Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.2
pip: 18.1
setuptools: 40.4.3
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.11.0
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 20.10.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you benchmark just the change in python, and just the change in pandas separately?

@dragoljub
Copy link
Author

Can you benchmark just the change in python, and just the change in pandas separately?

Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try newer pandas version on old Python.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 5, 2018 via email

@gfyoung gfyoung added Performance Memory or execution speed performance IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Nov 5, 2018
@dragoljub
Copy link
Author

I ran the test on an older Python 3.5 stack with the latest Pandas version 0.23.4 but with a lot of older versions of other modules and it looks to be running faster on Python 3.5. Now, I'm not quite sure if its pandas directly on python 3.7.1 or one of its dependencies.

Does the parser's _read() method rely on some other library that may be the culprit?

 %prun df2 = pd.read_csv(csv)
         5154 function calls (5041 primitive calls) in 2.004 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.960    1.960    1.962    1.962 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    2.004    2.004 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.000    0.000 common.py:1835(_get_dtype_type)

Installed Python 3.5 Stack With Latest pandas 0.23.4:

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.5.0
pip: 18.0
setuptools: 38.6.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.10.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

<\details>

@dragoljub
Copy link
Author

Interestingly if I specify float_precision='round_trip' I get similar parsing speeds. If I specify 'high' or None then its back to the same 3.5x difference.

Python 3.7.1

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4320 function calls (4286 primitive calls) in 4.074 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.862    3.862    3.864    3.864 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.168    0.168    4.074    4.074 <string>:1(<module>)
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.906    3.906 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
      952    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}

Python 3.5.2

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4582 function calls (4545 primitive calls) in 3.716 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.665    3.665    3.667    3.667 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.031    0.031    0.031    0.031 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.710    3.710 parsers.py:414(_read)
        1    0.006    0.006    3.716    3.716 <string>:1(<module>)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)

@dragoljub
Copy link
Author

Adding another data point. If I specify the 'python' engine. It looks like on Python 3.7.1 pandas._libs.lib.maybe_convert_numeric is 3X slower than on Python on 3.5.2

Could this be due to the cython version?

Python 3.7.1

%prun df2 = pd.read_csv(csv, engine='python')
         7003613 function calls (7003575 primitive calls) in 14.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    9.698    0.970    9.698    0.970 {pandas._libs.lib.maybe_convert_numeric}
  1000004    3.221    0.000    3.221    0.000 {built-in method builtins.next}
        1    0.386    0.386    4.066    4.066 parsers.py:2926(_get_lines)
        1    0.263    0.263   14.399   14.399 parsers.py:1029(read)
        4    0.154    0.038    0.247    0.062 parsers.py:2738(_remove_empty_lines)
  1000002    0.138    0.000    3.359    0.000 parsers.py:2681(_next_iter_line)
  2000072    0.125    0.000    0.125    0.000 {method 'append' of 'list' objects}
        1    0.116    0.116    0.116    0.116 {pandas._libs.lib.to_object_array}
  1000001    0.103    0.000    0.138    0.000 parsers.py:2869(<genexpr>)
2000130/2000117    0.069    0.000    0.069    0.000 {built-in method builtins.len}
       14    0.067    0.005    0.204    0.015 {built-in method builtins.max}

Python 3.5.2

 %prun df2 = pd.read_csv(csv, engine='python')
         7004040 function calls (7004000 primitive calls) in 8.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000003    3.662    0.000    3.662    0.000 {built-in method builtins.next}
       10    3.270    0.327    3.270    0.327 {pandas._libs.lib.maybe_convert_numeric}
        1    0.378    0.378    4.492    4.492 parsers.py:2926(_get_lines)
        1    0.263    0.263    8.398    8.398 parsers.py:1029(read)
        4    0.167    0.042    0.245    0.061 parsers.py:2738(_remove_empty_lines)
  1000002    0.141    0.000    3.803    0.000 parsers.py:2681(_next_iter_line)
        1    0.128    0.128    0.128    0.128 {pandas._libs.lib.to_object_array}
  2000067    0.109    0.000    0.109    0.000 {method 'append' of 'list' objects}
  1000001    0.108    0.000    0.133    0.000 parsers.py:2869(<genexpr>)
       14    0.060    0.004    0.193    0.014 {built-in method builtins.max}

@jbrockmendel
Copy link
Member

These last numbers are with what pandas version?

@dragoljub
Copy link
Author

These last numbers are with what pandas version?

They are both Pandas 0.23.4

@dragoljub
Copy link
Author

I tried building the latest Pandas version from source on Python 3.7.1 and still got the same slower performance. Are there any build/compile/cython flags I can set to optimize the parser?

@jreback
Copy link
Contributor

jreback commented Nov 6, 2018

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

@dragoljub
Copy link
Author

dragoljub commented Nov 6, 2018

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

I tried all three different float_precision= flags and for 'high' and None the 3.5x slowdown was still present in Python 3.7.1 vs python 2.5.2.

I also tried specifying a float_format= in the pd.to_csv() and I still see the same consistent 3.5x gap.

Can you reproduce in Python 3.6?

I should reiterate this perf difference is on the same version of Pandas 0.23.4 just different version of Python.

@dragoljub
Copy link
Author

is there a way to specify 'xstrtod', or is that specified by float_precision=None?

I see no performance changes between 'high' and None.

@dragoljub
Copy link
Author

Is anyone able to reproduce this on Python 3.7.1? I tested the code above on Python 3.7.0 using the Python.org interactive interpreter and it seemed to run faster than in my local 3.7.1 install.

@dragoljub
Copy link
Author

dragoljub commented Nov 7, 2018

Something is definitely up. I did a side by side compare reading the same CSV file on disk. Python 3.5 reads at 111 MB/sec and Python 3.7 reads at only 28 MB/Sec from the same SSD. Both running Pandas 0.23.4.

Could Python 3.7 have changed something in their I/O system?

Python 3.5.2 & Pandas 0.23.4

In [38]: %timeit pd.read_csv(r'out.csv', float_precision='high')
1.86 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python 3.7.1 & Pandas 0.23.4

In [17]: %timeit pd.read_csv(r'out.csv', float_precision='high')
7.97 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

image

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 7, 2018

I don't see the difference you're seeing

Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.59 s, sys: 214 ms, total: 2.81 s
Wall time: 2.73 s

3.7

Python 3.7.1 (default, Oct 23 2018, 14:07:42)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.61 s, sys: 211 ms, total: 2.82 s
Wall time: 2.74 s

Both of those are using Anaconda's packages.

@dragoljub
Copy link
Author

Tom, thanks for running this benchmark. Can you post your pd.show_versions() I want to re-create your stack exactly to do some more testing.

@TomAugspurger
Copy link
Contributor

3.5

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.15.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

3.7:

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@dragoljub
Copy link
Author

I tried several different fresh python installs on windows. Every Python 3.7 install 32 or 64 bit with Pandas 0.23.4 pip installed results in the slower CSV parsing speed. For fun I tried Installing a fresh Python 3.6.7 install and it again parses the same CSV 3X faster.

Is there anyone that could test this on Windows 10 and Python 3.7.1? 😕

@jorisvandenbossche
Copy link
Member

cc @chris-b1 in case you can test on Windows

@gfyoung gfyoung removed the Needs Info Clarification about behavior needed to assess issue label Nov 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2018

Indeed, I can confirm that there is a 3.5X slowdown when using Python 3.7.1 on Windows 10.

When I use Python 3.5.6, the performance is unchanged from 0.22.0 to 0.23.4.

These observations are consistent with what @dragoljub was observing and appears to suggest that this is a Cython / Python suggest and not pandas.

@toniatop
Copy link

toniatop commented Nov 8, 2018

On windows 10, python 3.6 and python 3.7 I note a noticeable slowdown as well.

(py36) PS C:\Users\ttttt> ipython
Python 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 7.03 s

In [3]: %time _ = pd.read_csv('out.csv')
Wall time: 7.04 s

python 3.7

(py37) PS C:\Users\ttttt> ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
In [6]: df.to_csv('out.csv')
In [7]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 29.4 s

In [8]: %time _ = pd.read_csv('out.csv')
Wall time: 31.3 s

@TomAugspurger
Copy link
Contributor

For people on windows, how are you installing pandas? From source, wheels, or conda packages? And if conda, from defaults or from conda-forge?

@toniatop
Copy link

toniatop commented Nov 8, 2018

Here conda-forge:

PS C:\Users\ttttt> activate py37
(py37) PS C:\Users\ttttt> conda install ipython pandas
Solving environment: done
## Package Plan ##
  environment location: C:\Miniconda\envs\py37
  added / updated specs:
    - ipython
    - pandas
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    ipython-7.1.1              |py37h39e3cac_1000         1.1 MB  conda-forge
    wcwidth-0.1.7              |             py_1          17 KB  conda-forge
    six-1.11.0                 |        py37_1001          21 KB  conda-forge
    pytz-2018.7                |             py_0         226 KB  conda-forge
    icc_rt-2017.0.4            |       h97af966_0         8.0 MB
    pygments-2.2.0             |             py_1         622 KB  conda-forge
    pickleshare-0.7.5          |        py37_1000          12 KB  conda-forge
    certifi-2018.10.15         |        py37_1000         137 KB  conda-forge
    backcall-0.1.0             |             py_0          13 KB  conda-forge
    mkl_random-1.0.1           |   py37h77b88f5_1         267 KB
    decorator-4.3.0            |             py_0          10 KB  conda-forge
    numpy-1.15.4               |   py37ha559c80_0          36 KB
    mkl-2019.0                 |              118       178.1 MB
    pandas-0.23.4              |py37h830ac7b_1000         8.7 MB  conda-forge
    prompt_toolkit-2.0.7       |             py_0         218 KB  conda-forge
    python-dateutil-2.7.5      |             py_0         218 KB  conda-forge
    colorama-0.4.0             |             py_0          15 KB  conda-forge
    mkl_fft-1.0.6              |   py37hdbbee80_0         120 KB
    jedi-0.13.1                |        py37_1000         228 KB  conda-forge
    intel-openmp-2019.0        |              118         1.7 MB
    parso-0.3.1                |             py_0          59 KB  conda-forge
    traitlets-4.3.2            |        py37_1000         130 KB  conda-forge
    ipython_genutils-0.2.0     |             py_1          21 KB  conda-forge
    numpy-base-1.15.4          |   py37h8128ebf_0         3.9 MB
    blas-1.0                   |              mkl           6 KB
    ------------------------------------------------------------
                                           Total:       203.7 MB

@TomAugspurger
Copy link
Contributor

Thanks @toniatop. Can you create a couple environments with just defaults to see if it's an issue with how it was compiled for conda-forge?

@toniatop
Copy link

toniatop commented Nov 8, 2018

Redone everything forcing --channel anaconda, same results.

@TomAugspurger
Copy link
Contributor

cc @jjhelmus any thoughts on
#23516 (comment)? The tldr is that

  • pd.read_csv is 3-4x slower on python 3.7 vs. python 3.6
  • the slowdown is (apparently) windows only
  • observed on packages from conda-forge and defaults

@dragoljub
Copy link
Author

I'd be surprised that change matters, but I'm at a loss here, so maybe! Another possibility is that cython made some tweaks to threading logic for python 3.7 compat - again, wouldn't think that's the issue here, but possible some kind of bad interaction.
cython/cython#1978

Good info. I'm just surprised that people do not see this on Linux. I'll try OSX next.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 9, 2018 via email

@cgohlke
Copy link
Contributor

cgohlke commented Nov 9, 2018

Looks like the slowdown first shows up in Python 3.7.0a4:

>C:\python-3.7.0a3-amd64\python.exe -m cProfile -s tottime pandascsv.py
         235992 function calls (229477 primitive calls) in 21.525 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      200   11.316    0.057   11.316    0.057 {method 'astype' of 'numpy.ndarray' objects}
      100    6.596    0.066    6.596    0.066 {pandas._libs.writers.write_csv_rows}
        1    2.111    2.111    2.112    2.112 {method 'read' of 'pandas._libs.parsers.TextReader' objects}       
>C:\python-3.7.0a4-amd64\python.exe -m cProfile -s tottime pandascsv.py
         236639 function calls (230127 primitive calls) in 26.550 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    9.849    9.849    9.850    9.850 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
      200    8.766    0.044    8.766    0.044 {method 'astype' of 'numpy.ndarray' objects}
      100    6.469    0.065    6.469    0.065 {pandas._libs.writers.write_csv_rows}      

@dragoljub
Copy link
Author

Very Interesting. I'll try Py 3.7.0a3 to confirm this on my systems. Is the diff between 0a3 to 0a4 easy to find from Pythons release notes?

@cgohlke
Copy link
Contributor

cgohlke commented Nov 9, 2018

easy to find from Pythons release notes?

https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4

Maybe bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540?

See also python/cpython@v3.7.0a3...v3.7.0a4

@dragoljub
Copy link
Author

I can also confirm the changes from Python 3.7.0a3 to 3.7.0a4 show the slowdown on my Win10 test system. Thanks for finding when the slowdown occurred.

Python 3.7.0a3 -- Fast Parse

 %prun df2 = pd.read_csv(csv)
         5781 function calls (5743 primitive calls) in 3.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.953    2.953    2.955    2.955 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.063    0.063    0.063    0.063 internals.py:5017(_stack_arrays)
        1    0.016    0.016    3.052    3.052 parsers.py:414(_read)
        1    0.009    0.009    3.062    3.062 <string>:1(<module>)
        1    0.009    0.009    0.009    0.009 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1235(find_spec)
      988    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      718    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1119(_get_spec)
      133    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       68    0.000    0.000    0.000    0.000 generic.py:7(_check)

Python 3.7.0a4 -- Slow Parse

 %prun df2 = pd.read_csv(csv)
         8007 function calls (7219 primitive calls) in 14.192 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   14.092   14.092   14.094   14.094 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.061    0.061    0.062    0.062 internals.py:5017(_stack_arrays)
        1    0.016    0.016   14.192   14.192 parsers.py:414(_read)
        1    0.008    0.008    0.008    0.008 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
    115/4    0.000    0.000    0.001    0.000 abc.py:194(__subclasscheck__)
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1322(find_spec)
 1324/988    0.000    0.000    0.002    0.000 {built-in method builtins.isinstance}
  937/725    0.000    0.000    0.002    0.000 {built-in method builtins.issubclass}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1206(_get_spec)
    89/78    0.000    0.000    0.000    0.000 {built-in method builtins.len}
      192    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}

@dragoljub
Copy link
Author

I tried playing around with the UTF-8 mode settings with ENV variables and cmd line args on Windows and was not able to get faster parsing speed on Python 3.7.0a4.

https://www.python.org/dev/peps/pep-0540/#proposal

The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python's encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings.

So is it possible that somewhere the C parser extension we can just set the locale to UTF-8 and this issue would go away on Windows? I was hoping the ENV variable settings would fix the issue but it did not make a difference in my testing.

@cgohlke
Copy link
Contributor

cgohlke commented Nov 10, 2018

I compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time...

3.7.0a3:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	705	347	28.52%	14.04%	parsers.cp37-win_amd64.pyd
   isdigit	207	207	8.37%	8.37%	ucrtbase.dll
 - _errno	105	39	4.25%	1.58%	ucrtbase.dll
   toupper	24	24	0.97%	0.97%	ucrtbase.dll
   isspace	21	21	0.85%	0.85%	ucrtbase.dll
   [python37.dll]	1	1	0.04%	0.04%	python37.dll
3.7.0a4:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	8,613	478	83.04%	4.61%	parsers.cp37-win_amd64.pyd
 + isdigit	6,642	208	64.04%	2.01%	ucrtbase.dll
 + _isdigit_l	6,434	245	62.03%	2.36%	ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate	5,806	947	55.98%	9.13%	ucrtbase.dll
 + __acrt_getptd	2,121	1,031	20.45%	9.94%	ucrtbase.dll
   FlsGetValue	647	647	6.24%	6.24%	KernelBase.dll
 - RtlSetLastWin32Error	296	235	2.85%	2.27%	ntdll.dll
   _guard_dispatch_icall_nop	101	101	0.97%	0.97%	ucrtbase.dll
   GetLastError	46	46	0.44%	0.44%	KernelBase.dll
 + __acrt_update_multibyte_info	1,475	246	14.22%	2.37%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,229	513	11.85%	4.95%	ucrtbase.dll
 + __acrt_update_locale_info	1,263	235	12.18%	2.27%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,028	429	9.91%	4.14%	ucrtbase.dll
   _ischartype_l	383	383	3.69%	3.69%	ucrtbase.dll

@dragoljub
Copy link
Author

Great work debugging this. I would guess any other code paths calling isdigit would also be slowed down on windows.

@jbrockmendel
Copy link
Member

Just a note for people looking at xstrtod (and thanks for doing so BTW, this looks like a really tough issue): there are two of them (#19361). Off the top of my head I'm not sure which is used in what context.

@dragoljub
Copy link
Author

dragoljub commented Nov 13, 2018

I may have found a pure python example that seems to show a similar but smaller 2.5X slowdown. Also note the variability is 15X higher for the 3.7.1 code. Possibly indicating that locale argument is passed/used in some calls but not others.

Can someone test this on linux and see if you see a difference?

Python 3.7.1

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
537 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python 3.5.2

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
215 ms ± 986 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

--> Based on comments from: https://bugs.python.org/msg329789
it appears this is a pure Unicode test. So it may be unrelated.

@cgohlke has posed a nice minimal example showing the slowdown: https://bugs.python.org/msg329790 Thanks! 👍

@chris-b1
Copy link
Contributor

Thanks for the investigation @cgohlke - for 0.24 I suppose we should just shim in an ASCII isdigit
function?

(MUSL, MIT licensed)
https://github.com/esmil/musl/blob/master/src/ctype/isdigit.c#L5

@chris-b1 chris-b1 added this to the 0.24.0 milestone Nov 18, 2018
@dragoljub
Copy link
Author

@chris-b1 I was thinking the same thing since its quite a simple function, however than changing locale would be limited. I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

@cgohlke
Copy link
Contributor

cgohlke commented Nov 19, 2018

I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

The source code for the Windows UCRT is available with recent Windows SDK. It is usually installed under C:\Program Files (x86)\Windows Kits\10\Source.

The isdigit and _isdigit_l functions are defined in ucrt\convert\_ctype.cpp:

extern "C" extern __inline int (__cdecl isdigit)(int const c)
{
    return __acrt_locale_changed()
        ? (_isdigit_l)(c, nullptr)
        : fast_check(c, _DIGIT);
}

extern "C" extern __inline int (__cdecl _isdigit_l)(int const c, _locale_t const locale)
{
    _LocaleUpdate locale_update(locale);

    return _isdigit_l(c, locale_update.GetLocaleT());
}

The following comment is from the _wsetlocale function:

// If no call has been made to setlocale to change locale from "C" locale
// to some other locale, we keep locale_changed = 0. Other functions that
// depend on locale use this variable to optimize performance for C locale
// which is normally the case in applications.

@dragoljub
Copy link
Author

So if I’m understanding it correctly. Even if we set the locale in Python to “C” the windows isdigit function would still resort to calling the locale isdigit version slowing down parsing because the local has ‘changed’.

Is that the case in python. 3.7.0a3? Setting locale to “C” slows parsing down?

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 25, 2018
@dragoljub
Copy link
Author

@jreback @TomAugspurger Do you think a simple shim of the isdigit function in the C parser code would be a fix we could entertain?

It would assume 'ASCII' compatible encoding for numeric columns which I think should cover all/most csv file encodings for digits.

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

@chris-b1
Copy link
Contributor

Yeah, if you want to submit PR, ping me, or if not, I'll try to get to it soon

@dragoljub
Copy link
Author

@chris-b1 Go for it! 😄

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 28, 2018
@dragoljub
Copy link
Author

@chris-b1 @jreback Thanks for getting this done! Really appreciate it! 👍 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants