Memory Crisis with oCIS 4.0.5 #8257

dragotin · 2024-01-22T07:55:30Z

Users on central report that oCIS is consuming memory when run on small hardware like Raspi. It might be the fact that it is only visible on the small devices, but happens in every installation and could also harm them.

One user reports that the problem can be mitigated by setting GOMEMLIMIT=999000000 when starting oCIS.

This ticket is about understanding what is going on and documenting the reason and mitigation at least in the dev docs. Does it make sense to set a GOMEMLIMIT for every installation?

Other BR's like #6621 or #6874 might or might not be related.

The text was updated successfully, but these errors were encountered:

rhafer · 2024-01-22T09:44:41Z

Just some additional information: On systems that already configure memory limits via cgroups (e.g. docker/podman when running with --memory. Or systemd units that have a MemoryLimit configured we already configure a GOMEMLIMIT via the automemlimit module (https://github.com/KimMachineGun/automemlimit).

When non of these limits is set it's rather difficult to come up with some useful default.

rhafer · 2024-01-22T09:47:45Z

What's interesting is that according to https://central.owncloud.org/t/memory-usage-of-ocis/45601/20 the MemoryLimit config in systemd does not have the desired effect This is something we should investigate.

dragonchaser · 2024-01-25T08:36:05Z

I have tried to reproduce that on 4.0.5 with no luck so far. I uploaded 2k files through web, 90k images through the client and rcloned these across two instances. I see no significant ramp-up in memory usage.

wkloucek · 2024-01-25T16:15:20Z

How is this ticket related those other already existing tickets?

dragonchaser · 2024-01-25T16:23:24Z

We don't know yet.... I was able to reproduce some memory spiking, but still unsure what is happening there...

dragonchaser · 2024-01-29T07:57:18Z

What I know so far:

Under certain conditions the go-garbage collector becomes blocking
when the gc becomes blocking it ramps up 4 times the memory set by GO_MEMLIMIT

How to reproduce:

create a vm for running ocis, set it up with comparable specs to an rpi 4, 8GiB
install ocis according to: https://doc.owncloud.com/ocis/next/depl-examples/bare-metal.html
Set GO_MEMLIMIT to a ridiculous small value (e.g. 512MiB)
Upload thousands of images (I used 90k, jpeg 1024*768, uploading 10k per folder)
get a trace in paralell (curl http://127.0.0.1:9205/debug/pprof/trace?seconds=60 > trace.out)
- trace can be viewed by running a websever on the file: go tool trace -http 0.0.0.0:30000 trace.out
- there should be large slots of gc runs visible that block ocis from running

Temporary workaround:

disable search by adding OCIS_EXCLUDE_RUN_SERVICES="search" to the environment variables

Ocis env configuration:

GOMEMLIMIT="512MiB"
OCIS_URL=https://<ocis_url>

PROXY_HTTP_ADDR=0.0.0.0:9200

PROXY_TLS=false
OCIS_LOG_LEVEL=debug
OCIS_CONFIG_DIR=/etc/ocis

OCIS_BASE_DATA_PATH=/var/lib/ocis

OCIS_TRACING_COLLECTOR=http://<jaeger-instance-address>/api/traces
OCIS_TRACING_ENABLED=true
OCIS_TRACING_ENDPOINT=<jaeger-instance-address>:6831
OCIS_TRACING_TYPE=jaeger

PROXY_DEBUG_PPROF="true"
PROXY_DEBUG_ZPAGES="true"
PROXY_DEBUG_ADDR=0.0.0.0:9205

PROXY_ENABLE_BASIC_AUTH="true"

OCIS_EXCLUDE_RUN_SERVICES="search"

dragonchaser · 2024-01-29T10:08:48Z

Conclusion:

The issue is caused by folders containing too many files. We have an environment variable that defines how many concurrent go routines are running: STORAGE_USERS_OCIS_MAX_CONCURRENCY this value defaults to 100. On systems with limited memory this causes to trigger the garbage collector often resulting in it becoming blocking. The only workaround for now is to set that value to a low value (empirically, in the case of 30k text files 5 was a working value).

Result:

Expectation:

The blue bars marked GC are the garbage collector runs.

Thanks @aduffeck @rhafer @fschade && @butonic for helping figuring this out.

dragonchaser · 2024-01-29T15:24:32Z

@mmattel can you extend the documenation that STORAGE_USERS_OCIS_MAX_CONCURRENCY may need to be configured now for large deployments with remote fs?

dragonchaser · 2024-01-29T15:26:11Z

How is this ticket related those other already existing tickets?

* [investigate goroutine leak in settings and graph services #6621](https://github.com/owncloud/ocis/issues/6621)

* [antivirus seems to keep files in memory #6803](https://github.com/owncloud/ocis/issues/6803)

* [OCIS becomes unresponsive when browsing images in frontend #6874](https://github.com/owncloud/ocis/issues/6874)

@wkloucek Those are unrelated, that specific case here is not a memory leak, but merely "expected misbehavior" :)

jvillafanez · 2024-01-30T08:57:07Z

The issue is caused by folders containing too many files. We have an environment variable that defines how many concurrent go routines are running: STORAGE_USERS_OCIS_MAX_CONCURRENCY this value defaults to 100

100 as default value seems too much. A default of 4 should be enough.
Note that, I'm quite convinced that the workers will be reused, which means there won't be additional overload of having to create new workers.

For the "real" recommended number of workers, we probably have to monitor the performance the workers have in order to do their task. If the workers take 1 second on average to do the task, we could spawn workers during that second, but going further than that will be overkill because the task that the 77th worker will do could be done by the 1st worker that have already finished its task.

In addition, for a real parallelism of the task, we'd need that each worker runs in a different CPU. If we have 4 CPUs available, having 4 workers make sense assuming each worker lands in a different CPU (not sure about the guarantees though). This would mean that the 4 workers would do their task in parallel.
Creating more workers won't add a substantial performance gain because the gain would depend on the CPU usage percentage the worker has relatively with the rest of the app. For example, if the app has 10 threads and 1 of then is a worker, the CPU usage assigned to that worker is expected to be 1/10; if we add 10 more workers, the usage would be 11/20, so 50% of the CPU time is expected to be assigned to perform the job instead of the initial 10%; of course assuming that all the threads aren't sleeping.

I'd recommend:

runtime.NumCPU() * 2 if we want to push for performance without worrying about memory (pretty sure still lower than 100)
runtime.NumCPU() for performance
4, or lower if we worry about memory usage (assuming 4 is still lower than the number of CPU available)

jvillafanez · 2024-01-30T09:21:25Z

On systems with limited memory this causes to trigger the garbage collector often resulting in it becoming blocking.

I have a different theory. The garbage collector isn't blocked, but it can't free memory because all of the memory is being used.

We have 100 workers picking tasks from a queue (or channel in this case). Each worker might need to use 10MB (made up number) of memory per task or even more depending on the task (getting a list of 1000 files will use more memory than getting just 10 files, even if it's just holding the data in memory).
The problem is that each of those workers won't free that memory until he picks a new task (resulting in a new loop), or the queue / channel closes and the thread finishes. Picking a new task would cause freeing the 10MB, but it might use 5MB or 14MB for the new task.

I guess this is why it's tricky to reproduce: memory usage depends on the data we're retrieving, and we have no control about which worker will handle the task, so maybe only half of the workers are used sometimes, which would reduce the memory usage and might not hit the limit.

dragonchaser · 2024-01-30T09:22:41Z

@jvillafanez In generall I agree, but the main goal of the 100 was to counter network latency issues. If you have a remote fs like s3 most of the processes will be in a waiting state until there is a response from the remote. So technically it would make sense to "overcommit" the cpu... Also if you run as single binary, the calculation runtime.NumCPU() * 2 might be too optimistic.

dragonchaser · 2024-01-30T13:35:52Z

@jvillafanez For this case I see this as resolved, any further discussion on best practices happening in owncloud/docs-ocis#702

dragotin added Type:Bug Priority:p2-high Escalation, on top of current planning, release blocker labels Jan 22, 2024

dragonchaser self-assigned this Jan 23, 2024

micbar added this to the Release 5.0.0 milestone Jan 26, 2024

This was referenced Jan 29, 2024

change default for MaxConcurrency cs3org/reva#4485

Merged

change default for MaxConcurrency #8309

Merged

mmattel mentioned this issue Jan 30, 2024

[5.0] Envvar for memory usage scaling via an envvar between RPi and HPC owncloud/docs-ocis#702

Closed

dragonchaser closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Crisis with oCIS 4.0.5 #8257

Memory Crisis with oCIS 4.0.5 #8257

dragotin commented Jan 22, 2024 •

edited

Loading

rhafer commented Jan 22, 2024

rhafer commented Jan 22, 2024

dragonchaser commented Jan 25, 2024

wkloucek commented Jan 25, 2024

dragonchaser commented Jan 25, 2024

dragonchaser commented Jan 29, 2024 •

edited

Loading

dragonchaser commented Jan 29, 2024 •

edited

Loading

dragonchaser commented Jan 29, 2024

dragonchaser commented Jan 29, 2024 •

edited

Loading

jvillafanez commented Jan 30, 2024

jvillafanez commented Jan 30, 2024

dragonchaser commented Jan 30, 2024 •

edited

Loading

dragonchaser commented Jan 30, 2024

Memory Crisis with oCIS 4.0.5 #8257

Memory Crisis with oCIS 4.0.5 #8257

Comments

dragotin commented Jan 22, 2024 • edited Loading

rhafer commented Jan 22, 2024

rhafer commented Jan 22, 2024

dragonchaser commented Jan 25, 2024

wkloucek commented Jan 25, 2024

dragonchaser commented Jan 25, 2024

dragonchaser commented Jan 29, 2024 • edited Loading

dragonchaser commented Jan 29, 2024 • edited Loading

dragonchaser commented Jan 29, 2024

dragonchaser commented Jan 29, 2024 • edited Loading

jvillafanez commented Jan 30, 2024

jvillafanez commented Jan 30, 2024

dragonchaser commented Jan 30, 2024 • edited Loading

dragonchaser commented Jan 30, 2024

dragotin commented Jan 22, 2024 •

edited

Loading

dragonchaser commented Jan 29, 2024 •

edited

Loading

dragonchaser commented Jan 29, 2024 •

edited

Loading

dragonchaser commented Jan 29, 2024 •

edited

Loading

dragonchaser commented Jan 30, 2024 •

edited

Loading