SoS memory usage for large workflows #1213

gaow · 2019-02-16T05:14:40Z

When I submitted sos command to compute node to submit jobs, I was surprised at how much memory it uses. I asked for 16GB memory, but I get this :

Job 57608706 exceeded memory limit (17451008 > 16384000)

I never realized it is such resource consuming. @BoPeng Do you see this on your end? Should I try to create some large test jobs, or you've got some jobs you could use to test?

The text was updated successfully, but these errors were encountered:

BoPeng · 2019-02-16T05:21:12Z

What job are you running? sos usually takes a few MB of RAM. It could use a few hundreds if you have a complex workflow but should not be more than that.

gaow · 2019-02-16T05:26:38Z

Oh okay looks like it should not happen. I'm looking into this. It's the workflow generated automatically with DSC -- will create a smaller example removing heavy computations but keeping the structure of the workflow, verify the memory usage on my own desktop (using this script) for you to reproduce.

gaow · 2019-02-17T01:22:51Z

@BoPeng Attached is an example. I removed the core computations so there should not be any issues due to computations in each step. Here is what I found:

$ python monitor.py sos run main.sos
time elapsed: 5.55s
peak first occurred: 3.02s
peak last occurred: 4.03s
max vms_memory: 21.34GB
max rss_memory: 0.98GB
memory check interval: 1s
return code: 0

You see the actual memory usage is 0.98GB but allocated memory is very high. The example is bundled as attached:

mem_test.tar.gz

BoPeng · 2019-02-17T19:02:48Z

Does not know how much virtual memory counts as I always check real memory RSS. Are you sure the job is killed due to virtual memory?

VIRT stands for the virtual size of a process, which is the sum of memory it is actually using, memory it has mapped into itself (for instance the video card’s RAM for the X server), files on disk that have been mapped into it (most notably shared libraries), and memory shared with other processes. VIRT represents how much memory the program is able to access at the present moment.

gaow · 2019-02-17T20:33:27Z

Perhaps that command above was overly minimal ... But for attached command even rss memory is 5.39GB. You can test the one below :

test.tar.gz

time elapsed: 58.87s
peak first occurred: 29.13s
peak last occurred: 29.13s
max vms_memory: 64.96GB
max rss_memory: 5.39GB
memory check interval: 1s
return code: 0

BoPeng · 2019-02-18T03:56:43Z

(sos) [bpeng1@BCBMF5KWK0F8F69:~/Downloads]$ python monitor.py sos run main -v1 -s force
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 182.99s
peak first occurred: 75.03s
peak last occurred: 166.26s
max vms_memory: 148.45GB
max rss_memory: 1.67GB
memory check interval: 1s
return code: 0

(sos) [bpeng1@BCBMF5KWK0F8F69:~/Downloads]$ python monitor.py sos run main -v1
[###########################################################] 59 steps processed (30 jobs completed, 1915 jobs ignored)

time elapsed: 12.67s
peak first occurred: 6.08s
peak last occurred: 11.15s
max vms_memory: 144.72GB
max rss_memory: 1.36GB
memory check interval: 1s
return code: 0

I have huge vms memory and relatively smaller rss memory.

gaow · 2019-02-18T04:00:24Z

Hmm this is interesting. It looks like it's a linux machine? Mine is Linux (Debian 9).

Currently the check interval is 1s. It can be modified via bash variable:

export MEM_CHECK_INTERVAL=0.1

In any case, what do you think is happening? Also do you think 1.36GB is on the high end? My actual application is bigger than this.

BoPeng · 2019-02-18T04:07:19Z

I do not know. In generally python is not good at memory management as large objects (nested dictionaries, large arrays) take a lot of RAM, and python's GC system means that peak memory usage is unpredictable and the peak can happen with both old (deleted but not released to system) and new variables.

All these make it pretty difficult to check what is going on.

BoPeng · 2019-02-18T04:37:07Z

On a Ubuntu VM,

$ python monitor.py sos run main.sos -v1
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 328.00s
peak first occurred: 224.06s
peak last occurred: 293.01s
max vms_memory: 10.19GB
max rss_memory: 1.17GB
memory check interval: 1s
return code: 0

gaow · 2019-02-18T04:40:31Z

what if you set

export MEM_CHECK_INTERVAL=0.1

and run again? Actually I've tried two systems: on the cluster (Scientific Linux 7, which is essentially Redhat 7) it directly leads to disconnection to the cluster due to over the memory limit 2GB. On my desktop I was able to get that 5.39GB memory foot print

BoPeng · 2019-02-18T04:42:07Z

(base) bpeng1@ubuntu:~/Downloads$ python monitor.py sos run main.sos -v1
[###########################################################] 59 steps processed (30 jobs completed, 1915 jobs ignored)

time elapsed: 16.80s
peak first occurred: 4.03s
peak last occurred: 16.63s
max vms_memory: 8.23GB
max rss_memory: 0.85GB
memory check interval: 0.1s
return code: 0

BoPeng · 2019-02-18T04:46:07Z

Latest master and try again? Not sure if anything is related but I am using the master.

gaow · 2019-02-18T04:49:12Z

Sure, this is what i got on the latest master:

time elapsed: 58.97s
peak first occurred: 32.33s
peak last occurred: 32.33s
max vms_memory: 62.61GB
max rss_memory: 4.78GB
memory check interval: 0.1s
return code: 0

...

BoPeng · 2019-02-18T04:49:56Z

The second example from this post?

gaow · 2019-02-18T04:51:57Z

Yes that post has only one example right? The complete output is:

$ export MEM_CHECK_INTERVAL=0.1
$ python monitor.py sos run main.sos -v1 -s force
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 58.97s
peak first occurred: 32.33s
peak last occurred: 32.33s
max vms_memory: 62.61GB
max rss_memory: 4.78GB
memory check interval: 0.1s
return code: 0

My Python is:

$ python --version
Python 3.6.1 :: Continuum Analytics, Inc.
```

Maybe I should get Python 3.7???

BoPeng · 2019-02-18T04:53:01Z

Cannot imagine such a huge difference, and the earlier report was on a Python 3.6.7 on mac.

gaow · 2019-02-18T04:54:07Z

You are right Python 3.7.1 does not make it any better ...

BoPeng · 2019-02-18T05:10:36Z

DAG of the workflow. Basically there are a large number of nested workflows...

Liked the umbrella shape.

gaow · 2019-02-18T05:19:07Z

It is both beautiful and ugly :P

BoPeng · 2019-02-18T05:20:17Z

In theory nested workflows could grow out of the step that generates them, but that is too much work...

gaow · 2019-02-18T05:27:12Z

Sorry I do not understand. Would you elaborate?

The nested workflow "design" was how I generated lots of benchmarks automatically via the DSC syntax. It is a stress test / challenge for SoS. But we are going to use this type of design a lot for benchmarking. Perhaps we can revisit this together later if we have more resource to do it; but any improvement at this point would be extremely helpful!

BoPeng · 2019-02-18T05:29:39Z

I was just stating how the nested workflows are placed on the animation image. These subworkflows have their own DAG (in your case single step) but in theory they are generated by a step from the mother workflow so we could use some sort of dash line to connect the subworkflow from the steps on the master, but to do that I will have to merge graphs in the dot file which can be quite troubllesome.

gaow · 2019-02-18T19:24:34Z

got a bigger job running and killed on a compute node with 32GB memory requested!

exceeded memory limit (41185432 > 32768000), being killed

BoPeng · 2019-02-18T20:41:49Z

What is your -j setting? The waiting processes do use memory though.

gaow · 2019-02-18T20:43:36Z

Good point -- I did not set -j. So the default is half the CPU threads available on that machine, which is 14 I believe ...

BoPeng · 2019-02-18T21:08:26Z

Your monitor script could write a more detailed log file with command and usage so that we know if the memory is used for tasks (for which sos should have less overhead) or main step; is used evenly or for one of the processes.

#1218

BoPeng · 2019-04-10T00:14:32Z

Sounds good. The number of processes is supposed to be j+1 plus the number of non-terminal nested workflows, which is a feature we allow but strongly discourage.

gaow · 2019-04-10T17:11:26Z

BTW can we make a release for this (so I'll ask other people to simply upgrade SoS not adding -c 4), if reasonable? Thanks!

BoPeng · 2019-04-10T17:18:39Z

ok, only one real fix (#1246) but I have just released 0.19.4 since all tests pass.

BoPeng · 2019-04-10T17:21:02Z

hold on, my patch does not look right

default=min(max(os.cpu_count() // 2, 8), 1),

BoPeng · 2019-04-10T17:22:19Z

It should be

default=min(max(os.cpu_count() // 2, 1), 8)

BoPeng · 2019-04-10T17:26:20Z

Also, is 8 a reasonable default? Since sos itself takes 1 and 4 processes seems to be a reasonable number of "multi-processing" job, it looks to me that

default=min(max(os.cpu_count() // 2, 1), 3)

can be a better "default" value.

gaow · 2019-04-10T17:31:43Z

Well if most SoS users run pipelines as remote tasks then possibly this is a better default. But to run on a single desktop then I feel min(CPU/2, 8) is a good default, at least for a desktop workstation.

gaow · 2019-04-10T17:32:24Z

and oops sorry indeed that earlier patch does not look right ...

BoPeng · 2019-04-10T17:35:29Z

Fixed and made a new release.

gaow · 2019-04-10T17:37:09Z

Great, thank you!

gaow · 2019-04-10T22:08:27Z

Unfortunately, 4 threads did not help:

time elapsed: 698.82s
peak first occurred: 477.08s
peak last occurred: 477.08s
max vms_memory: 24.02GB
max rss_memory: 20.56GB

and I see the hanging behavior comes back. Using ps command I see defunct processes:

16829 pts/178  00:01:06 sos <defunct>
16977 pts/178  00:01:06 sos <defunct>
16979 pts/178  00:01:06 sos <defunct>
19202 pts/178  00:03:35 sos
49618 pts/178  00:02:19 sos
```

and after I `ctrl-C`, I get:

16829 ? 00:01:06 sos
16977 ? 00:01:06 sos
16979 ? 00:01:06 sos
19202 ? 00:03:36 sos
49618 ? 00:02:19 sos


that is, everything is `defunct` now when `sos` should just quit. The only thing that looks correct is the number -- 5 processes is `-j4` + 1.

I'm trying to put together a minimal working example for this.

BoPeng · 2019-04-10T22:14:04Z

Do you mean it worked locally but failed on a compute node with -j 4?

gaow · 2019-04-10T22:24:04Z

This is what I'm trying to find out now. In principle -j 4 should use the same amount of memory using task (remote) vs not task (local) runs? I never had memory issues whatsoever for local runs but I assume this is because the workflow is not large enough. I'm wrapping up something right now but should be able to provide MWE offline soon.

gaow · 2019-04-11T00:47:57Z

Now I get:

[MW] ps -u gaow | grep sos
 5628 pts/46   00:03:16 sos
32614 pts/46   00:16:50 sos
34731 pts/46   00:04:37 sos
43916 pts/46   00:00:02 sos
51186 pts/46   00:03:16 sos <defunct>
51202 pts/46   00:03:17 sos <defunct>

When I kill it with ctrl-c this time all got killed. The memory usage is 18GB.

I did not finish all the run, but I think this is good enough a test? I've sent you the DM.

BoPeng · 2019-04-11T13:21:52Z

For a real-world example, at some point I see a lot of messages like

2019-04-11 00:11:39,463 - DEBUG - WORKER 57007 - 15060.3Mb - SUBSTEP
REQUESTED: 4 workers (of which 0 is blocking), 14159 requested, 7509
processed

while the last number increases slowly, the "requested" number
increases rapidly. That means a large number of substeps are queued in
the controller waiting to be processed. The substep messages can be
large if it contains large variables.

A likely solution is for step workers to stop sending so many substeps all at once.

gaow · 2019-04-11T13:26:01Z

The substep messages can be
large if it contains large variables.

Each of them contains a copy of global variable, right? And a copy of step_input and step_output?

BoPeng · 2019-04-11T13:29:14Z

Yes. Unlike concurrent=False where all substeps are executed in the step worker and share the same global variables, concurrent substeps are sent to the master (then dispatch to substep workers) each with a copy of the global variables. So when the step worker sends tens of thousands of substeps all at once to the master, the master queues tens of thousands of global variables.

BoPeng · 2019-04-11T14:14:21Z

A persistent queue on the master side sounds like a more robust solution

https://pypi.org/project/persist-queue/

Although we should use in memory queue with small applications.

BoPeng · 2019-04-11T14:38:48Z

Just for reference, the following seems to be working and confirmed that the substep queue is the source of the problem

diff --git a/src/sos/workers.py b/src/sos/workers.py
index 52eaf10..eba85d3 100755
--- a/src/sos/workers.py
+++ b/src/sos/workers.py
@@ -5,9 +5,11 @@
 
 import multiprocessing as mp
 import os
+import pickle
 import signal
 import time
 from typing import Any, Dict, Optional
+from queuelib import FifoDiskQueue
 
 import zmq
 
@@ -318,7 +320,7 @@ class WorkerManager(object):
         self._worker_alive_time = time.time()
         self._last_pending_time = {}
 
-        self._substep_requests = []
+        self._substep_requests = FifoDiskQueue('testfilefile')
         self._step_requests = {}
 
         self._worker_backend_socket = backend_socket
@@ -341,7 +343,7 @@ class WorkerManager(object):
     def add_request(self, msg_type, msg):
         self._n_requested += 1
         if msg_type == 'substep':
-            self._substep_requests.insert(0, msg)
+            self._substep_requests.push(pickle.dumps(msg))
             self.report(f'Substep requested')
         else:
             port = msg['config']['sockets']['master_port']
@@ -414,9 +416,9 @@ class WorkerManager(object):
             self._worker_backend_socket.send_pyobj(None)
             self._num_workers -= 1
             self.report(f'Blocking worker {ports} killed')
-        elif self._substep_requests:
+        elif len(self._substep_requests) > 0:
             # port is not claimed, free to use for substep worker
-            msg = self._substep_requests.pop()
+            msg = pickle.loads(self._substep_requests.pop())
             self._worker_backend_socket.send_pyobj(msg)
             self._n_processed += 1
             self.report(f'Substep processed with {ports[0]}')

although it is a bad idea to use persistent queue all the time.

gaow · 2019-04-11T14:41:15Z

So the idea is to use file-based not memory based queue? I think more fundamentally it might worth trying to reduce the size of the queue because 20GB on the disk is also a lot resource, not to mention the possible big I/O bottleneck.

BoPeng · 2019-04-11T14:49:54Z

This is basically some work to confirm the source of the problem. It is hard to know the performance penalty for disk-based queue but after -s build through the workflow,

The on-disk queue is big

[MW] ls -lh ~/tmp/10-Apr-2019/testfilefile/q00000 
-rw-rw-r-- 1 gaow gaow 45G Apr 11 09:37 /home/gaow/tmp/10-Apr-2019/testfilefile/q00000

I did not check the source but I believe this simple implementation just store all processed and unprocessed messages on disk and use a pointer to bypass popped messages.

The maximum memory usage of the master process is 2.5G , in contrast to > 10G before.

I did not check the size of each substep but on-disk (or database) queue is unavoidable for large workflows. We just need to pick a solution with reasonable comprise between performance and memory/disk sizes.

BoPeng · 2019-04-11T14:58:49Z

Note that I also had a patch that prevents step workers from submitting too many substeps to the master. Basically, the step worker knows how many substeps are being processed and can wait for results before submitting new ones. This can be a better approach because there is no point of sending a large amount of substeps if we know the system is busy, but

When trunk_size is involved, say, trunk_size=1000, we have to wait at least several "batches" because step worker processes return results in batch. We also need to avoid deadlock condition if the worker waits for part of a "batch" that is not submitted.
We need to consider the number of workers as well when deciding if we need to send new substeps.
In theory if there are many step workers, the master will still be swamped with substeps even if each step worker contributes, for example, 1000 substeps.

I believe a proper solution should be a combination of client (step worker) side throttle, and server side on-disk cache.

Here is an untested patch for step worker

diff --git a/src/sos/step_executor.py b/src/sos/step_executor.py
index 3367cca8..4fe2780e 100755
--- a/src/sos/step_executor.py
+++ b/src/sos/step_executor.py
@@ -684,8 +684,15 @@ class Base_Step_Executor:
     def process_returned_substep_result(self, till=None, wait=True):
         while True:
             if not wait:
-                if not self.result_pull_socket.poll(0):
-                    return
+                # 1213
+                cur_index = env.sos_dict['_index']
+                trunk_size = env.sos_dict['_runtime']['trunk_size'] if 'trunk_size' in env.sos_dict['_runtime'] and isinstance(env.sos_dict['_runtime']['trunk_size'], int) else 1
+                pending_trunks = (cur_index - self._completed_concurrent_substeps) // trunk_size
+                if pending_trunks < 100:
+                    # if there are more than 100 pending trunks (e.g. 1000 substeps if trunk_size is 10)
+                    # we wait indefinitely for the results
+                    if not self.result_pull_socket.poll(0):
+                        return
             elif self._completed_concurrent_substeps == till:
                 return
             yield self.result_pull_socket

…1213

BoPeng · 2019-04-11T18:34:22Z

Glad to realize that trunk_size does not matter so the problem is easy to fix. Just submitted a patch and make sure that there are at most 10 substeps waiting on the master side for each step worker. That means that will be at most 100 pending substeps for -j10, which is certainly tolerable and does not need to be persisted to disk.

Let me know if the patch works.

gaow · 2019-04-11T19:02:14Z

Great, yes it works -- with -j4 I now get 2.13GB RSS for what used to cost me >20G. Looks like we can close this issue, if I'm not being too optimistic!

BoPeng mentioned this issue Feb 18, 2019

Slow in generating report for large workflow. #1214

Closed

This was referenced Feb 19, 2019

SoS cluster job hangs #1216

Closed

Using coroutines to execute multiple steps and subworkflows in a single worker #1218

Closed

BoPeng pushed a commit that referenced this issue Feb 25, 2019

Using coroutine to execute multiple steps in a single worker #1056 #1213

b6d57e6

#1218

BoPeng pushed a commit that referenced this issue Apr 10, 2019

Fix default number of jobs #1213

eab72b7

BoPeng pushed a commit that referenced this issue Apr 10, 2019

Release sos 0.19.5 after fixing #1213

bf683c3

gaow mentioned this issue Apr 11, 2019

Target unavailable false alarm #1249

Closed

BoPeng added a commit that referenced this issue Apr 11, 2019

Wait for more less than 10 pending results before submitting new ones #…

9acb6e7

…1213

BoPeng closed this as completed Apr 11, 2019

gaow mentioned this issue Apr 13, 2019

max_running_jobs cannot prevent sos from submitting new jobs #1248

Closed

SoS memory usage for large workflows #1213

SoS memory usage for large workflows #1213

Comments

gaow commented Feb 16, 2019

BoPeng commented Feb 16, 2019

gaow commented Feb 16, 2019

gaow commented Feb 17, 2019

BoPeng commented Feb 17, 2019

gaow commented Feb 17, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019 • edited Loading

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

gaow commented Feb 18, 2019

BoPeng commented Feb 18, 2019

BoPeng commented Apr 10, 2019

gaow commented Apr 10, 2019

BoPeng commented Apr 10, 2019

BoPeng commented Apr 10, 2019

BoPeng commented Apr 10, 2019

BoPeng commented Apr 10, 2019

gaow commented Apr 10, 2019

gaow commented Apr 10, 2019

BoPeng commented Apr 10, 2019

gaow commented Apr 10, 2019

gaow commented Apr 10, 2019

BoPeng commented Apr 10, 2019

gaow commented Apr 10, 2019

gaow commented Apr 11, 2019

BoPeng commented Apr 11, 2019

gaow commented Apr 11, 2019 • edited Loading

BoPeng commented Apr 11, 2019

BoPeng commented Apr 11, 2019

BoPeng commented Apr 11, 2019 • edited Loading

gaow commented Apr 11, 2019

BoPeng commented Apr 11, 2019

BoPeng commented Apr 11, 2019 • edited Loading

BoPeng commented Apr 11, 2019

gaow commented Apr 11, 2019

BoPeng commented Feb 18, 2019 •

edited

Loading

gaow commented Apr 11, 2019 •

edited

Loading

BoPeng commented Apr 11, 2019 •

edited

Loading

BoPeng commented Apr 11, 2019 •

edited

Loading