Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

Closed
jdfekete opened this issue Nov 28, 2015 · 9 comments
Closed

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

jdfekete opened this issue Nov 28, 2015 · 9 comments
Labels
IO CSV read_csv, to_csv

Comments

@jdfekete
Copy link

I am reading a very large csv file (the NYC taxi dataset at https://storage.googleapis.com/tlc-trip-data/2015/), only two columns:
index_col=False,skipinitialspace=True,usecols=['pickup_longitude', 'pickup_latitude'], chunksize=...
I load it progressively by varying-size chunks, and use 2 threads to do the progressive loading.
After reading about 10M lines (the number varies from one run to the other), it dumps a core.
Here is what GDB finds-out:

Fatal Python error: GC object already tracked
Fatal Python error: GC object already tracked

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffddd98700 (LWP 10284)]
0x00007ffff782dcc9 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0 0x00007ffff782dcc9 in __GI_raise (sig=sig@entry=6)

at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1 0x00007ffff78310d8 in __GI_abort () at abort.c:89
#2 0x000000000045a4f2 in Py_FatalError ()
#3 0x000000000052b5ec in PyTuple_New ()
#4 0x000000000050c73d in ?? ()
#5 0x000000000050d3f6 in Py_BuildValue ()
#6 0x00007fffec3d01d8 in buffer_rd_bytes (source=0x7fffd8006650,

nbytes=<optimized out>, bytes_read=0x7fffddd96d08, status=0x7fffddd96d04)
at pandas/src/parser/io.c:123

#7 0x00007fffec3cf065 in parser_buffer_bytes (nbytes=,

self=0x7fffd8003480) at pandas/src/parser/tokenizer.c:610

#8 _tokenize_helper (self=0x7fffd8003480, nrows=nrows@entry=3186,

all=all@entry=0) at pandas/src/parser/tokenizer.c:1872

#9 0x00007fffec3cf3e7 in tokenize_nrows (self=,

nrows=nrows@entry=3186) at pandas/src/parser/tokenizer.c:1905

#10 0x00007fffec39a3c4 in __pyx_f_6pandas_6parser_10TextReader__tokenize_rows (

__pyx_v_self=0x7fffdddd5050, __pyx_v_nrows=3186) at pandas/parser.c:8745

#11 0x00007fffec3a21a2 in __pyx_f_6pandas_6parser_10TextReader__read_rows (

__pyx_v_self=0x7fffdddd5050, __pyx_v_rows=0x7fffd8249a88, __pyx_v_trim=0)
at pandas/parser.c:8970

#12 0x00007fffec393f0c in __pyx_f_6pandas_6parser_10TextReader__read_low_memory

(__pyx_v_self=0x7fffdddd5050, __pyx_v_rows=0x7fffcb815948)
@jreback
Copy link
Contributor

jreback commented Nov 28, 2015

pls show the exact code you are using

@jdfekete
Copy link
Author

My code is in https://github.com/jdfekete/progressivis file:
https://github.com/jdfekete/progressivis/blob/master/progressivis/io/csv_loader.py

The method is the following, see the last line for the call, and all the checks before. Running it with pandas 0.16.2 works without dumping core. It might be due to the GIL or lack thereof since this code is run in a second thread.

def run_step(self,run_number,step_size, howlong):
    if step_size==0: # bug
        logger.error('Received a step_size of 0')
        return self._return_run_step(self.state_ready, steps_run=0)
    status = self.validate_parser(run_number)
    if status==self.state_terminated:
        raise StopIteration('no more filenames')
    elif status==self.state_blocked:
        return self._return_run_step(status, steps_run=0, creates=0)
    elif status != self.state_ready:
        logger.error('Invalid state returned by validate_parser: %d', status)
        raise StopIteration('Unexpected situation')
    logger.info('loading %d lines', step_size)
    try:
        df = self.parser.read(step_size) # raises StopIteration at EOF
    except StopIteration:

@jreback
Copy link
Contributor

jreback commented Nov 28, 2015

pls just show a short reproducible example

@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

This is almost certainly a problem with thread-safeness in how you are calling it. A reproducible example would help. Pls reopen when you post that.

@jreback jreback closed this as completed Nov 29, 2015
@jreback jreback added Can't Repro IO CSV read_csv, to_csv labels Nov 29, 2015
@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

xref #11786

@jstray
Copy link

jstray commented Feb 7, 2017

I am also seeing this error, intermittently, during read_csv. It's not even a particularly large file:

table = pd.read_csv(io.StringIO(csvres.text))
=>
Fatal Python error: GC object already tracked

where the text is the contents of the file http://jonathanstray.com/papers/titanic.csv

I'm not explicitly using threads in my app, though I am on Django channels.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

you should try a more modern version of pandas., lots of things have been fixed since 0.17.1

@jstray
Copy link

jstray commented Feb 7, 2017

Indeed I am on 0.17.1. FWIW that's the version that shipped with Anaconda, though now I can't recall when I installed it.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

conda update pandas

works wonders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants