Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SoS memory usage for large workflows #1213

Closed
gaow opened this issue Feb 16, 2019 · 71 comments
Closed

SoS memory usage for large workflows #1213

gaow opened this issue Feb 16, 2019 · 71 comments

Comments

@gaow
Copy link
Member

gaow commented Feb 16, 2019

When I submitted sos command to compute node to submit jobs, I was surprised at how much memory it uses. I asked for 16GB memory, but I get this :

Job 57608706 exceeded memory limit (17451008 > 16384000)

I never realized it is such resource consuming. @BoPeng Do you see this on your end? Should I try to create some large test jobs, or you've got some jobs you could use to test?

@BoPeng
Copy link
Contributor

BoPeng commented Feb 16, 2019

What job are you running? sos usually takes a few MB of RAM. It could use a few hundreds if you have a complex workflow but should not be more than that.

@gaow
Copy link
Member Author

gaow commented Feb 16, 2019

Oh okay looks like it should not happen. I'm looking into this. It's the workflow generated automatically with DSC -- will create a smaller example removing heavy computations but keeping the structure of the workflow, verify the memory usage on my own desktop (using this script) for you to reproduce.

@gaow
Copy link
Member Author

gaow commented Feb 17, 2019

@BoPeng Attached is an example. I removed the core computations so there should not be any issues due to computations in each step. Here is what I found:

$ python monitor.py sos run main.sos
time elapsed: 5.55s
peak first occurred: 3.02s
peak last occurred: 4.03s
max vms_memory: 21.34GB
max rss_memory: 0.98GB
memory check interval: 1s
return code: 0

You see the actual memory usage is 0.98GB but allocated memory is very high. The example is bundled as attached:

mem_test.tar.gz

@BoPeng
Copy link
Contributor

BoPeng commented Feb 17, 2019

Does not know how much virtual memory counts as I always check real memory RSS. Are you sure the job is killed due to virtual memory?

VIRT stands for the virtual size of a process, which is the sum of memory it is actually using, memory it has mapped into itself (for instance the video card’s RAM for the X server), files on disk that have been mapped into it (most notably shared libraries), and memory shared with other processes. VIRT represents how much memory the program is able to access at the present moment.

@gaow
Copy link
Member Author

gaow commented Feb 17, 2019

Perhaps that command above was overly minimal ... But for attached command even rss memory is 5.39GB. You can test the one below :

test.tar.gz

time elapsed: 58.87s
peak first occurred: 29.13s
peak last occurred: 29.13s
max vms_memory: 64.96GB
max rss_memory: 5.39GB
memory check interval: 1s
return code: 0

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

(sos) [bpeng1@BCBMF5KWK0F8F69:~/Downloads]$ python monitor.py sos run main -v1 -s force
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 182.99s
peak first occurred: 75.03s
peak last occurred: 166.26s
max vms_memory: 148.45GB
max rss_memory: 1.67GB
memory check interval: 1s
return code: 0
(sos) [bpeng1@BCBMF5KWK0F8F69:~/Downloads]$ python monitor.py sos run main -v1
[###########################################################] 59 steps processed (30 jobs completed, 1915 jobs ignored)

time elapsed: 12.67s
peak first occurred: 6.08s
peak last occurred: 11.15s
max vms_memory: 144.72GB
max rss_memory: 1.36GB
memory check interval: 1s
return code: 0

I have huge vms memory and relatively smaller rss memory.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

Hmm this is interesting. It looks like it's a linux machine? Mine is Linux (Debian 9).

Currently the check interval is 1s. It can be modified via bash variable:

export MEM_CHECK_INTERVAL=0.1

In any case, what do you think is happening? Also do you think 1.36GB is on the high end? My actual application is bigger than this.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

I do not know. In generally python is not good at memory management as large objects (nested dictionaries, large arrays) take a lot of RAM, and python's GC system means that peak memory usage is unpredictable and the peak can happen with both old (deleted but not released to system) and new variables.

All these make it pretty difficult to check what is going on.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

On a Ubuntu VM,

$ python monitor.py sos run main.sos -v1
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 328.00s
peak first occurred: 224.06s
peak last occurred: 293.01s
max vms_memory: 10.19GB
max rss_memory: 1.17GB
memory check interval: 1s
return code: 0

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

what if you set

export MEM_CHECK_INTERVAL=0.1

and run again? Actually I've tried two systems: on the cluster (Scientific Linux 7, which is essentially Redhat 7) it directly leads to disconnection to the cluster due to over the memory limit 2GB. On my desktop I was able to get that 5.39GB memory foot print

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

(base) bpeng1@ubuntu:~/Downloads$ python monitor.py sos run main.sos -v1
[###########################################################] 59 steps processed (30 jobs completed, 1915 jobs ignored)

time elapsed: 16.80s
peak first occurred: 4.03s
peak last occurred: 16.63s
max vms_memory: 8.23GB
max rss_memory: 0.85GB
memory check interval: 0.1s
return code: 0

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

Latest master and try again? Not sure if anything is related but I am using the master.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

Sure, this is what i got on the latest master:

time elapsed: 58.97s
peak first occurred: 32.33s
peak last occurred: 32.33s
max vms_memory: 62.61GB
max rss_memory: 4.78GB
memory check interval: 0.1s
return code: 0

...

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

The second example from this post?

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

Yes that post has only one example right? The complete output is:

$ export MEM_CHECK_INTERVAL=0.1
$ python monitor.py sos run main.sos -v1 -s force
[###########################################################] 59 steps processed (1945 jobs completed)

time elapsed: 58.97s
peak first occurred: 32.33s
peak last occurred: 32.33s
max vms_memory: 62.61GB
max rss_memory: 4.78GB
memory check interval: 0.1s
return code: 0

My Python is:

$ python --version
Python 3.6.1 :: Continuum Analytics, Inc.
```

Maybe I should get Python 3.7???

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

Cannot imagine such a huge difference, and the earlier report was on a Python 3.6.7 on mac.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

You are right Python 3.7.1 does not make it any better ...

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

DAG of the workflow. Basically there are a large number of nested workflows...

image

Liked the umbrella shape.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

It is both beautiful and ugly :P

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

In theory nested workflows could grow out of the step that generates them, but that is too much work...

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

Sorry I do not understand. Would you elaborate?

The nested workflow "design" was how I generated lots of benchmarks automatically via the DSC syntax. It is a stress test / challenge for SoS. But we are going to use this type of design a lot for benchmarking. Perhaps we can revisit this together later if we have more resource to do it; but any improvement at this point would be extremely helpful!

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

I was just stating how the nested workflows are placed on the animation image. These subworkflows have their own DAG (in your case single step) but in theory they are generated by a step from the mother workflow so we could use some sort of dash line to connect the subworkflow from the steps on the master, but to do that I will have to merge graphs in the dot file which can be quite troubllesome.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

got a bigger job running and killed on a compute node with 32GB memory requested!

exceeded memory limit (41185432 > 32768000), being killed

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

What is your -j setting? The waiting processes do use memory though.

@gaow
Copy link
Member Author

gaow commented Feb 18, 2019

Good point -- I did not set -j. So the default is half the CPU threads available on that machine, which is 14 I believe ...

@BoPeng
Copy link
Contributor

BoPeng commented Feb 18, 2019

Your monitor script could write a more detailed log file with command and usage so that we know if the memory is used for tasks (for which sos should have less overhead) or main step; is used evenly or for one of the processes.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

Sounds good. The number of processes is supposed to be j+1 plus the number of non-terminal nested workflows, which is a feature we allow but strongly discourage.

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

BTW can we make a release for this (so I'll ask other people to simply upgrade SoS not adding -c 4), if reasonable? Thanks!

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

ok, only one real fix (#1246) but I have just released 0.19.4 since all tests pass.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

hold on, my patch does not look right

default=min(max(os.cpu_count() // 2, 8), 1),

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

It should be

default=min(max(os.cpu_count() // 2, 1), 8)

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

Also, is 8 a reasonable default? Since sos itself takes 1 and 4 processes seems to be a reasonable number of "multi-processing" job, it looks to me that

default=min(max(os.cpu_count() // 2, 1), 3)

can be a better "default" value.

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

Well if most SoS users run pipelines as remote tasks then possibly this is a better default. But to run on a single desktop then I feel min(CPU/2, 8) is a good default, at least for a desktop workstation.

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

and oops sorry indeed that earlier patch does not look right ...

BoPeng pushed a commit that referenced this issue Apr 10, 2019
BoPeng pushed a commit that referenced this issue Apr 10, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

Fixed and made a new release.

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

Great, thank you!

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

Unfortunately, 4 threads did not help:

time elapsed: 698.82s
peak first occurred: 477.08s
peak last occurred: 477.08s
max vms_memory: 24.02GB
max rss_memory: 20.56GB

and I see the hanging behavior comes back. Using ps command I see defunct processes:

16829 pts/178  00:01:06 sos <defunct>
16977 pts/178  00:01:06 sos <defunct>
16979 pts/178  00:01:06 sos <defunct>
19202 pts/178  00:03:35 sos
49618 pts/178  00:02:19 sos
```

and after I `ctrl-C`, I get:

16829 ? 00:01:06 sos
16977 ? 00:01:06 sos
16979 ? 00:01:06 sos
19202 ? 00:03:36 sos
49618 ? 00:02:19 sos


that is, everything is `defunct` now when `sos` should just quit. The only thing that looks correct is the number -- 5 processes is `-j4` + 1.

I'm trying to put together a minimal working example for this.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 10, 2019

Do you mean it worked locally but failed on a compute node with -j 4?

@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

This is what I'm trying to find out now. In principle -j 4 should use the same amount of memory using task (remote) vs not task (local) runs? I never had memory issues whatsoever for local runs but I assume this is because the workflow is not large enough. I'm wrapping up something right now but should be able to provide MWE offline soon.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

Now I get:

[MW] ps -u gaow | grep sos
 5628 pts/46   00:03:16 sos
32614 pts/46   00:16:50 sos
34731 pts/46   00:04:37 sos
43916 pts/46   00:00:02 sos
51186 pts/46   00:03:16 sos <defunct>
51202 pts/46   00:03:17 sos <defunct>

When I kill it with ctrl-c this time all got killed. The memory usage is 18GB.

I did not finish all the run, but I think this is good enough a test? I've sent you the DM.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

For a real-world example, at some point I see a lot of messages like

2019-04-11 00:11:39,463 - DEBUG - WORKER 57007 - 15060.3Mb - SUBSTEP
REQUESTED: 4 workers (of which 0 is blocking), 14159 requested, 7509
processed

while the last number increases slowly, the "requested" number
increases rapidly. That means a large number of substeps are queued in
the controller waiting to be processed. The substep messages can be
large if it contains large variables.

A likely solution is for step workers to stop sending so many substeps all at once.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

The substep messages can be
large if it contains large variables.

Each of them contains a copy of global variable, right? And a copy of step_input and step_output?

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

Yes. Unlike concurrent=False where all substeps are executed in the step worker and share the same global variables, concurrent substeps are sent to the master (then dispatch to substep workers) each with a copy of the global variables. So when the step worker sends tens of thousands of substeps all at once to the master, the master queues tens of thousands of global variables.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

A persistent queue on the master side sounds like a more robust solution

https://pypi.org/project/persist-queue/

Although we should use in memory queue with small applications.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

Just for reference, the following seems to be working and confirmed that the substep queue is the source of the problem

diff --git a/src/sos/workers.py b/src/sos/workers.py
index 52eaf10..eba85d3 100755
--- a/src/sos/workers.py
+++ b/src/sos/workers.py
@@ -5,9 +5,11 @@
 
 import multiprocessing as mp
 import os
+import pickle
 import signal
 import time
 from typing import Any, Dict, Optional
+from queuelib import FifoDiskQueue
 
 import zmq
 
@@ -318,7 +320,7 @@ class WorkerManager(object):
         self._worker_alive_time = time.time()
         self._last_pending_time = {}
 
-        self._substep_requests = []
+        self._substep_requests = FifoDiskQueue('testfilefile')
         self._step_requests = {}
 
         self._worker_backend_socket = backend_socket
@@ -341,7 +343,7 @@ class WorkerManager(object):
     def add_request(self, msg_type, msg):
         self._n_requested += 1
         if msg_type == 'substep':
-            self._substep_requests.insert(0, msg)
+            self._substep_requests.push(pickle.dumps(msg))
             self.report(f'Substep requested')
         else:
             port = msg['config']['sockets']['master_port']
@@ -414,9 +416,9 @@ class WorkerManager(object):
             self._worker_backend_socket.send_pyobj(None)
             self._num_workers -= 1
             self.report(f'Blocking worker {ports} killed')
-        elif self._substep_requests:
+        elif len(self._substep_requests) > 0:
             # port is not claimed, free to use for substep worker
-            msg = self._substep_requests.pop()
+            msg = pickle.loads(self._substep_requests.pop())
             self._worker_backend_socket.send_pyobj(msg)
             self._n_processed += 1
             self.report(f'Substep processed with {ports[0]}')

although it is a bad idea to use persistent queue all the time.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

So the idea is to use file-based not memory based queue? I think more fundamentally it might worth trying to reduce the size of the queue because 20GB on the disk is also a lot resource, not to mention the possible big I/O bottleneck.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

This is basically some work to confirm the source of the problem. It is hard to know the performance penalty for disk-based queue but after -s build through the workflow,

  1. The on-disk queue is big
[MW] ls -lh ~/tmp/10-Apr-2019/testfilefile/q00000 
-rw-rw-r-- 1 gaow gaow 45G Apr 11 09:37 /home/gaow/tmp/10-Apr-2019/testfilefile/q00000

I did not check the source but I believe this simple implementation just store all processed and unprocessed messages on disk and use a pointer to bypass popped messages.

  1. The maximum memory usage of the master process is 2.5G , in contrast to > 10G before.

I did not check the size of each substep but on-disk (or database) queue is unavoidable for large workflows. We just need to pick a solution with reasonable comprise between performance and memory/disk sizes.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

Note that I also had a patch that prevents step workers from submitting too many substeps to the master. Basically, the step worker knows how many substeps are being processed and can wait for results before submitting new ones. This can be a better approach because there is no point of sending a large amount of substeps if we know the system is busy, but

  1. When trunk_size is involved, say, trunk_size=1000, we have to wait at least several "batches" because step worker processes return results in batch. We also need to avoid deadlock condition if the worker waits for part of a "batch" that is not submitted.

  2. We need to consider the number of workers as well when deciding if we need to send new substeps.

  3. In theory if there are many step workers, the master will still be swamped with substeps even if each step worker contributes, for example, 1000 substeps.

I believe a proper solution should be a combination of client (step worker) side throttle, and server side on-disk cache.

Here is an untested patch for step worker

diff --git a/src/sos/step_executor.py b/src/sos/step_executor.py
index 3367cca8..4fe2780e 100755
--- a/src/sos/step_executor.py
+++ b/src/sos/step_executor.py
@@ -684,8 +684,15 @@ class Base_Step_Executor:
     def process_returned_substep_result(self, till=None, wait=True):
         while True:
             if not wait:
-                if not self.result_pull_socket.poll(0):
-                    return
+                # 1213
+                cur_index = env.sos_dict['_index']
+                trunk_size = env.sos_dict['_runtime']['trunk_size'] if 'trunk_size' in env.sos_dict['_runtime'] and isinstance(env.sos_dict['_runtime']['trunk_size'], int) else 1
+                pending_trunks = (cur_index - self._completed_concurrent_substeps) // trunk_size
+                if pending_trunks < 100:
+                    # if there are more than 100 pending trunks (e.g. 1000 substeps if trunk_size is 10)
+                    # we wait indefinitely for the results
+                    if not self.result_pull_socket.poll(0):
+                        return
             elif self._completed_concurrent_substeps == till:
                 return
             yield self.result_pull_socket

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

Glad to realize that trunk_size does not matter so the problem is easy to fix. Just submitted a patch and make sure that there are at most 10 substeps waiting on the master side for each step worker. That means that will be at most 100 pending substeps for -j10, which is certainly tolerable and does not need to be persisted to disk.

Let me know if the patch works.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

Great, yes it works -- with -j4 I now get 2.13GB RSS for what used to cost me >20G. Looks like we can close this issue, if I'm not being too optimistic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants