forked from rndp89/pyspark
-
Notifications
You must be signed in to change notification settings - Fork 0
/
persist_cache.txt
39 lines (28 loc) · 2.25 KB
/
persist_cache.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
spark.read.json(PATH) #it takes time to load data, because spark tries to understand the data.
spark.read.text #it returns immediately as there is not schema attached to it.
whatever your approach is to read the data, DATA DOESNOT RESIDE ON THE MEMORY, DATA GETS LOADED TO MEMORY ONLY WHEN IT IS BEING PROCESSED AND THEN GETS FLUSHED OUT.
Serialization is saving the byte stream into file
Deserialization is coverting byte stream into objects, each record gets converted into object.
BUT, Sometimes we might want to retain the dataset in memory that is needed to be processed multiple times as part of the same job.
from pyspark.storagelevel import StorageLevel
DF/rdd.persist(StorageLevel.MEMORY_ONLY) #now go to spark web UI and check storage tab. #USE THIS IF YOU AND TO PROCESS SAME DATA AGAIN AND AGAIN.
DF/rdd.persist(StorageLevel(True, True, False, False, 1))
use the RDD.unpersist() method
Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.
Now let us see how we can persist data frames.
By default data will be streamed as data frames to executor tasks as data being processed.
Here is what will happen when data is read into executor task while it is being processed
Deserialize into object
Stream into memory
Process data by executor task by applying logic
Flush deserialized objects from memory as executor tasks are terminated
Some times we might have to read same data multiple times for processing with in the same job. By default every time data need to be deserialized and submitted to executor tasks for processing
To avoid deserializing into java objects when same data have to be read multiple times we can leverage caching.
There are 2 methods persist and cache. By default with data frames caching will be done as MEMORY_AND_DISK from Spark 2.
cache is shorthand method for persist at MEMORY_AND_DISK
This is what happens when we cache Data Frame
Caching will be done only when data is read at least once for processing
Each record will be deserialized into object
These deserialized objects will be cached in memory as long as they fit
If not, deserialized objects will be spilled out to disk
You can get details about different persistence levels from here.