-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load big files in chunks #66
Conversation
dtypes = [dtype[i] for i, d in enumerate(dtype.descr)] | ||
return dtypes | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for this function to live in dtypes.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although here is where those file_type are defined. Don't mind either way
Is there a downside to loading by chunks? Should we always be doing it? What about a default |
I didn't see any. if not chunksize is defined the default works as fast as the previous version. Using a default chunksize might work as well but its a tricky number. Today I uploaded 25M 4-column file in one single shot. but 130M 12-column file in a 2-million row chunks. For fits you can get rows and columns before hand and make a guess, for csv is not straightforward. Also, it depends on memory resources as well. I'd bet 10M is a reasonable number to start with. |
It seems like it would be better to have the |
I'm not sure there is one very efficient without reading the file and even from the shell command wc -l is slow for very big files. You might get a guess by reading the first line to get the number of columns and possible data types and based on the size of the file estimate the number of rows or directly |
I like that idea. What about reading the first ~100 (1000?) lines and guessing at the size per line. Then dividing the file into chunks based on the total file size? I really do think that chunking in units of MB rather than rows makes a lot more sense. |
Any more thoughts about this? If I get time I could make a pass at writing the code... |
No description provided.