load big files in chunks #66

mgckind · 2016-04-27T22:16:46Z

No description provided.

kadrlica · 2016-04-27T22:43:21Z

easyaccess/eautils/fileio.py

+        dtypes = [dtype[i] for i, d in enumerate(dtype.descr)]
+    return dtypes
+
+


Would it make sense for this function to live in dtypes.py?

Yes, although here is where those file_type are defined. Don't mind either way

kadrlica · 2016-04-27T22:55:58Z

Is there a downside to loading by chunks? Should we always be doing it? What about a default chunksize set in the ea_config.py (maybe set to a fairly large value)?

mgckind · 2016-04-28T01:06:03Z

I didn't see any. if not chunksize is defined the default works as fast as the previous version. Using a default chunksize might work as well but its a tricky number. Today I uploaded 25M 4-column file in one single shot. but 130M 12-column file in a 2-million row chunks. For fits you can get rows and columns before hand and make a guess, for csv is not straightforward. Also, it depends on memory resources as well. I'd bet 10M is a reasonable number to start with.

kadrlica · 2016-04-28T03:32:15Z

It seems like it would be better to have the chunksize constraint to be in in MB rather than rows (something like upload_max_mb similar to outfile_max_mb). Is there any memory safe way to get the number of rows in a csv file? If so, we could do something like (upload_max_mb/filesize)*nrows = chunksize.

mgckind · 2016-04-28T21:13:35Z

I'm not sure there is one very efficient without reading the file and even from the shell command wc -l is slow for very big files. You might get a guess by reading the first line to get the number of columns and possible data types and based on the size of the file estimate the number of rows or directly chunksize

kadrlica · 2016-04-28T21:47:59Z

I like that idea. What about reading the first ~100 (1000?) lines and guessing at the size per line. Then dividing the file into chunks based on the total file size? I really do think that chunking in units of MB rather than rows makes a lot more sense.

kadrlica · 2016-05-09T00:32:21Z

Any more thoughts about this? If I get time I could make a pass at writing the code...

mgckind added 9 commits April 26, 2016 09:57

adding parser to load_table

c2d2279

inline parsing arguments for load_table

656076f

Merge branch 'master' into chunks

b0be810

added --tablename options for inside and outside the interpreter

6389aa9

added --chunksize option

42f3836

Added load and append tables by chunks

4d505ed

Merge branch 'master' into chunks

add35d9

wrapping up this fix or loading big files

163532c

fix documentation

66f91c6

mgckind merged commit af6c4c0 into master Apr 27, 2016

kadrlica reviewed Apr 27, 2016
View reviewed changes

kadrlica mentioned this pull request Apr 28, 2016

Chunking large files for upload #38

Closed

kadrlica mentioned this pull request May 10, 2016

Chunk large files by size rather than rows #72

Closed

mgckind deleted the chunks branch October 2, 2018 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load big files in chunks #66

load big files in chunks #66

mgckind commented Apr 27, 2016

kadrlica Apr 27, 2016

mgckind Apr 28, 2016

kadrlica commented Apr 27, 2016

mgckind commented Apr 28, 2016

kadrlica commented Apr 28, 2016

mgckind commented Apr 28, 2016

kadrlica commented Apr 28, 2016

kadrlica commented May 9, 2016

		dtypes = [dtype[i] for i, d in enumerate(dtype.descr)]
		return dtypes

load big files in chunks #66

load big files in chunks #66

Conversation

mgckind commented Apr 27, 2016

kadrlica Apr 27, 2016

Choose a reason for hiding this comment

mgckind Apr 28, 2016

Choose a reason for hiding this comment

kadrlica commented Apr 27, 2016

mgckind commented Apr 28, 2016

kadrlica commented Apr 28, 2016

mgckind commented Apr 28, 2016

kadrlica commented Apr 28, 2016

kadrlica commented May 9, 2016