Skip to content

The `map` Function and disk.frame's usage of memory

evalparse edited this page Aug 4, 2019 · 7 revisions

In disk.frame each chunk of a data set is stored as an fst file. The most common way of working with a disk.frame is to use the map function. The map function takes as arguments a disk.frame and a function, f, e.g. map(df, f). The idea is that the function f will be applied to each chunk of the disk.frame, df. By default, the application is lazy, meaning that the results are not returned immediately.

If run lazily (the default), the user has to call collect on the resultant "disk.frame". The disk.frame is in quotes because the user can choose not to return a data.frame in which case the returned data is a list. Of course, one may wish to write out the results as another disk.frame. In this case, we may need to use the writr_disk.frame function.

Normally, parallel processing is performed by map. The number of parallel processes defaults to the number of physical CPU cores (so hyperthreading on Intel CPUs do not double the number of processes). A process is also called a worker. So the amount of memories required should be sufficient to carry N simultaneous operations of f(CHUNK) where N is the number of workers.

Also, one has to remember that each worker is independent and do not have access to variable created by any other worker. However, all workers have access to variables in the global environment that it can detect as being used. So when can a global variable be detected? Generally, it can be detected if it was used in open code, but not so if it's used in a string e.g. glue::glue("bar {{vars1}}") is not going to work.

To change the number of workers run setup_disk.frame(workers=N). For example, if you are running out of memory of think that it is not using enough workers.

Clone this wiki locally