-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoS memory usage for large workflows #1213
Comments
What job are you running? sos usually takes a few MB of RAM. It could use a few hundreds if you have a complex workflow but should not be more than that. |
Oh okay looks like it should not happen. I'm looking into this. It's the workflow generated automatically with DSC -- will create a smaller example removing heavy computations but keeping the structure of the workflow, verify the memory usage on my own desktop (using this script) for you to reproduce. |
@BoPeng Attached is an example. I removed the core computations so there should not be any issues due to computations in each step. Here is what I found:
You see the actual memory usage is 0.98GB but allocated memory is very high. The example is bundled as attached: |
Does not know how much virtual memory counts as I always check real memory
|
Perhaps that command above was overly minimal ... But for attached command even rss memory is 5.39GB. You can test the one below :
|
I have huge vms memory and relatively smaller rss memory. |
Hmm this is interesting. It looks like it's a linux machine? Mine is Linux (Debian 9). Currently the check interval is 1s. It can be modified via bash variable:
In any case, what do you think is happening? Also do you think 1.36GB is on the high end? My actual application is bigger than this. |
I do not know. In generally python is not good at memory management as large objects (nested dictionaries, large arrays) take a lot of RAM, and python's GC system means that peak memory usage is unpredictable and the peak can happen with both old (deleted but not released to system) and new variables. All these make it pretty difficult to check what is going on. |
On a Ubuntu VM,
|
what if you set
and run again? Actually I've tried two systems: on the cluster (Scientific Linux 7, which is essentially Redhat 7) it directly leads to disconnection to the cluster due to over the memory limit 2GB. On my desktop I was able to get that 5.39GB memory foot print |
|
Latest master and try again? Not sure if anything is related but I am using the master. |
Sure, this is what i got on the latest
... |
The second example from this post? |
Yes that post has only one example right? The complete output is:
My Python is:
|
Cannot imagine such a huge difference, and the earlier report was on a Python 3.6.7 on mac. |
You are right Python 3.7.1 does not make it any better ... |
It is both beautiful and ugly :P |
In theory nested workflows could grow out of the step that generates them, but that is too much work... |
Sorry I do not understand. Would you elaborate? The nested workflow "design" was how I generated lots of benchmarks automatically via the DSC syntax. It is a stress test / challenge for SoS. But we are going to use this type of design a lot for benchmarking. Perhaps we can revisit this together later if we have more resource to do it; but any improvement at this point would be extremely helpful! |
I was just stating how the nested workflows are placed on the animation image. These subworkflows have their own DAG (in your case single step) but in theory they are generated by a step from the mother workflow so we could use some sort of dash line to connect the subworkflow from the steps on the master, but to do that I will have to merge graphs in the dot file which can be quite troubllesome. |
got a bigger job running and killed on a compute node with 32GB memory requested!
|
What is your |
Good point -- I did not set |
Your monitor script could write a more detailed log file with command and usage so that we know if the memory is used for tasks (for which sos should have less overhead) or main step; is used evenly or for one of the processes. |
Sounds good. The number of processes is supposed to be |
BTW can we make a release for this (so I'll ask other people to simply upgrade SoS not adding |
ok, only one real fix (#1246) but I have just released 0.19.4 since all tests pass. |
hold on, my patch does not look right
|
It should be
|
Also, is
can be a better "default" value. |
Well if most SoS users run pipelines as remote tasks then possibly this is a better default. But to run on a single desktop then I feel min(CPU/2, 8) is a good default, at least for a desktop workstation. |
and oops sorry indeed that earlier patch does not look right ... |
Fixed and made a new release. |
Great, thank you! |
Unfortunately, 4 threads did not help:
and I see the hanging behavior comes back. Using
16829 ? 00:01:06 sos
|
Do you mean it worked locally but failed on a compute node with -j 4? |
This is what I'm trying to find out now. In principle |
Now I get:
When I kill it with I did not finish all the run, but I think this is good enough a test? I've sent you the DM. |
For a real-world example, at some point I see a lot of messages like
while the last number increases slowly, the "requested" number A likely solution is for step workers to stop sending so many substeps all at once. |
Each of them contains a copy of global variable, right? And a copy of |
Yes. Unlike |
A persistent queue on the master side sounds like a more robust solution https://pypi.org/project/persist-queue/ Although we should use in memory queue with small applications. |
Just for reference, the following seems to be working and confirmed that the substep queue is the source of the problem diff --git a/src/sos/workers.py b/src/sos/workers.py
index 52eaf10..eba85d3 100755
--- a/src/sos/workers.py
+++ b/src/sos/workers.py
@@ -5,9 +5,11 @@
import multiprocessing as mp
import os
+import pickle
import signal
import time
from typing import Any, Dict, Optional
+from queuelib import FifoDiskQueue
import zmq
@@ -318,7 +320,7 @@ class WorkerManager(object):
self._worker_alive_time = time.time()
self._last_pending_time = {}
- self._substep_requests = []
+ self._substep_requests = FifoDiskQueue('testfilefile')
self._step_requests = {}
self._worker_backend_socket = backend_socket
@@ -341,7 +343,7 @@ class WorkerManager(object):
def add_request(self, msg_type, msg):
self._n_requested += 1
if msg_type == 'substep':
- self._substep_requests.insert(0, msg)
+ self._substep_requests.push(pickle.dumps(msg))
self.report(f'Substep requested')
else:
port = msg['config']['sockets']['master_port']
@@ -414,9 +416,9 @@ class WorkerManager(object):
self._worker_backend_socket.send_pyobj(None)
self._num_workers -= 1
self.report(f'Blocking worker {ports} killed')
- elif self._substep_requests:
+ elif len(self._substep_requests) > 0:
# port is not claimed, free to use for substep worker
- msg = self._substep_requests.pop()
+ msg = pickle.loads(self._substep_requests.pop())
self._worker_backend_socket.send_pyobj(msg)
self._n_processed += 1
self.report(f'Substep processed with {ports[0]}') although it is a bad idea to use persistent queue all the time. |
So the idea is to use file-based not memory based queue? I think more fundamentally it might worth trying to reduce the size of the queue because 20GB on the disk is also a lot resource, not to mention the possible big I/O bottleneck. |
This is basically some work to confirm the source of the problem. It is hard to know the performance penalty for disk-based queue but after
I did not check the source but I believe this simple implementation just store all processed and unprocessed messages on disk and use a pointer to bypass popped messages.
I did not check the size of each substep but on-disk (or database) queue is unavoidable for large workflows. We just need to pick a solution with reasonable comprise between performance and memory/disk sizes. |
Note that I also had a patch that prevents step workers from submitting too many substeps to the master. Basically, the step worker knows how many substeps are being processed and can wait for results before submitting new ones. This can be a better approach because there is no point of sending a large amount of substeps if we know the system is busy, but
I believe a proper solution should be a combination of client (step worker) side throttle, and server side on-disk cache. Here is an untested patch for step worker diff --git a/src/sos/step_executor.py b/src/sos/step_executor.py
index 3367cca8..4fe2780e 100755
--- a/src/sos/step_executor.py
+++ b/src/sos/step_executor.py
@@ -684,8 +684,15 @@ class Base_Step_Executor:
def process_returned_substep_result(self, till=None, wait=True):
while True:
if not wait:
- if not self.result_pull_socket.poll(0):
- return
+ # 1213
+ cur_index = env.sos_dict['_index']
+ trunk_size = env.sos_dict['_runtime']['trunk_size'] if 'trunk_size' in env.sos_dict['_runtime'] and isinstance(env.sos_dict['_runtime']['trunk_size'], int) else 1
+ pending_trunks = (cur_index - self._completed_concurrent_substeps) // trunk_size
+ if pending_trunks < 100:
+ # if there are more than 100 pending trunks (e.g. 1000 substeps if trunk_size is 10)
+ # we wait indefinitely for the results
+ if not self.result_pull_socket.poll(0):
+ return
elif self._completed_concurrent_substeps == till:
return
yield self.result_pull_socket |
Glad to realize that Let me know if the patch works. |
Great, yes it works -- with |
When I submitted
sos
command to compute node to submit jobs, I was surprised at how much memory it uses. I asked for 16GB memory, but I get this :I never realized it is such resource consuming. @BoPeng Do you see this on your end? Should I try to create some large test jobs, or you've got some jobs you could use to test?
The text was updated successfully, but these errors were encountered: