Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load big files in chunks #66

Merged
merged 9 commits into from
Apr 27, 2016
Merged

load big files in chunks #66

merged 9 commits into from
Apr 27, 2016

Conversation

mgckind
Copy link
Owner

@mgckind mgckind commented Apr 27, 2016

No description provided.

@mgckind mgckind merged commit af6c4c0 into master Apr 27, 2016
dtypes = [dtype[i] for i, d in enumerate(dtype.descr)]
return dtypes


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for this function to live in dtypes.py?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although here is where those file_type are defined. Don't mind either way

@kadrlica
Copy link
Collaborator

Is there a downside to loading by chunks? Should we always be doing it? What about a default chunksize set in the ea_config.py (maybe set to a fairly large value)?

@mgckind
Copy link
Owner Author

mgckind commented Apr 28, 2016

I didn't see any. if not chunksize is defined the default works as fast as the previous version. Using a default chunksize might work as well but its a tricky number. Today I uploaded 25M 4-column file in one single shot. but 130M 12-column file in a 2-million row chunks. For fits you can get rows and columns before hand and make a guess, for csv is not straightforward. Also, it depends on memory resources as well. I'd bet 10M is a reasonable number to start with.

@kadrlica
Copy link
Collaborator

It seems like it would be better to have the chunksize constraint to be in in MB rather than rows (something like upload_max_mb similar to outfile_max_mb). Is there any memory safe way to get the number of rows in a csv file? If so, we could do something like (upload_max_mb/filesize)*nrows = chunksize.

@mgckind
Copy link
Owner Author

mgckind commented Apr 28, 2016

I'm not sure there is one very efficient without reading the file and even from the shell command wc -l is slow for very big files. You might get a guess by reading the first line to get the number of columns and possible data types and based on the size of the file estimate the number of rows or directly chunksize

@kadrlica
Copy link
Collaborator

I like that idea. What about reading the first ~100 (1000?) lines and guessing at the size per line. Then dividing the file into chunks based on the total file size? I really do think that chunking in units of MB rather than rows makes a lot more sense.

@kadrlica
Copy link
Collaborator

kadrlica commented May 9, 2016

Any more thoughts about this? If I get time I could make a pass at writing the code...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants