Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Crisis with oCIS 4.0.5 #8257

Closed
dragotin opened this issue Jan 22, 2024 · 13 comments
Closed

Memory Crisis with oCIS 4.0.5 #8257

dragotin opened this issue Jan 22, 2024 · 13 comments
Assignees
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Milestone

Comments

@dragotin
Copy link
Contributor

dragotin commented Jan 22, 2024

Users on central report that oCIS is consuming memory when run on small hardware like Raspi. It might be the fact that it is only visible on the small devices, but happens in every installation and could also harm them.

One user reports that the problem can be mitigated by setting GOMEMLIMIT=999000000 when starting oCIS.

This ticket is about understanding what is going on and documenting the reason and mitigation at least in the dev docs. Does it make sense to set a GOMEMLIMIT for every installation?

Other BR's like #6621 or #6874 might or might not be related.

@dragotin dragotin added Type:Bug Priority:p2-high Escalation, on top of current planning, release blocker labels Jan 22, 2024
@rhafer
Copy link
Contributor

rhafer commented Jan 22, 2024

Just some additional information: On systems that already configure memory limits via cgroups (e.g. docker/podman when running with --memory. Or systemd units that have a MemoryLimit configured we already configure a GOMEMLIMIT via the automemlimit module (https://github.com/KimMachineGun/automemlimit).

When non of these limits is set it's rather difficult to come up with some useful default.

@rhafer
Copy link
Contributor

rhafer commented Jan 22, 2024

What's interesting is that according to https://central.owncloud.org/t/memory-usage-of-ocis/45601/20 the MemoryLimit config in systemd does not have the desired effect This is something we should investigate.

@dragonchaser dragonchaser self-assigned this Jan 23, 2024
@dragonchaser
Copy link
Member

I have tried to reproduce that on 4.0.5 with no luck so far. I uploaded 2k files through web, 90k images through the client and rcloned these across two instances. I see no significant ramp-up in memory usage.

@wkloucek
Copy link
Contributor

@dragonchaser
Copy link
Member

We don't know yet.... I was able to reproduce some memory spiking, but still unsure what is happening there...

@micbar micbar added this to the Release 5.0.0 milestone Jan 26, 2024
@dragonchaser
Copy link
Member

dragonchaser commented Jan 29, 2024

What I know so far:

  • Under certain conditions the go-garbage collector becomes blocking
  • when the gc becomes blocking it ramps up 4 times the memory set by GO_MEMLIMIT

How to reproduce:

  • create a vm for running ocis, set it up with comparable specs to an rpi 4, 8GiB
  • install ocis according to: https://doc.owncloud.com/ocis/next/depl-examples/bare-metal.html
  • Set GO_MEMLIMIT to a ridiculous small value (e.g. 512MiB)
  • Upload thousands of images (I used 90k, jpeg 1024*768, uploading 10k per folder)
  • get a trace in paralell (curl http://127.0.0.1:9205/debug/pprof/trace?seconds=60 > trace.out)
    • trace can be viewed by running a websever on the file: go tool trace -http 0.0.0.0:30000 trace.out
    • there should be large slots of gc runs visible that block ocis from running

Temporary workaround:

  • disable search by adding OCIS_EXCLUDE_RUN_SERVICES="search" to the environment variables

Ocis env configuration:

GOMEMLIMIT="512MiB"
OCIS_URL=https://<ocis_url>

PROXY_HTTP_ADDR=0.0.0.0:9200

PROXY_TLS=false
OCIS_LOG_LEVEL=debug
OCIS_CONFIG_DIR=/etc/ocis

OCIS_BASE_DATA_PATH=/var/lib/ocis

OCIS_TRACING_COLLECTOR=http://<jaeger-instance-address>/api/traces
OCIS_TRACING_ENABLED=true
OCIS_TRACING_ENDPOINT=<jaeger-instance-address>:6831
OCIS_TRACING_TYPE=jaeger

PROXY_DEBUG_PPROF="true"
PROXY_DEBUG_ZPAGES="true"
PROXY_DEBUG_ADDR=0.0.0.0:9205

PROXY_ENABLE_BASIC_AUTH="true"

OCIS_EXCLUDE_RUN_SERVICES="search"

@dragonchaser
Copy link
Member

dragonchaser commented Jan 29, 2024

Conclusion:

The issue is caused by folders containing too many files. We have an environment variable that defines how many concurrent go routines are running: STORAGE_USERS_OCIS_MAX_CONCURRENCY this value defaults to 100. On systems with limited memory this causes to trigger the garbage collector often resulting in it becoming blocking. The only workaround for now is to set that value to a low value (empirically, in the case of 30k text files 5 was a working value).

Result:

image

Expectation:

image

The blue bars marked GC are the garbage collector runs.

Thanks @aduffeck @rhafer @fschade && @butonic for helping figuring this out.

@dragonchaser
Copy link
Member

@mmattel can you extend the documenation that STORAGE_USERS_OCIS_MAX_CONCURRENCY may need to be configured now for large deployments with remote fs?

@dragonchaser
Copy link
Member

dragonchaser commented Jan 29, 2024

How is this ticket related those other already existing tickets?

* [investigate goroutine leak in settings and graph services #6621](https://github.com/owncloud/ocis/issues/6621)

* [antivirus seems to keep files in memory #6803](https://github.com/owncloud/ocis/issues/6803)

* [OCIS becomes unresponsive when browsing images in frontend #6874](https://github.com/owncloud/ocis/issues/6874)

@wkloucek Those are unrelated, that specific case here is not a memory leak, but merely "expected misbehavior" :)

@jvillafanez
Copy link
Member

The issue is caused by folders containing too many files. We have an environment variable that defines how many concurrent go routines are running: STORAGE_USERS_OCIS_MAX_CONCURRENCY this value defaults to 100

100 as default value seems too much. A default of 4 should be enough.
Note that, I'm quite convinced that the workers will be reused, which means there won't be additional overload of having to create new workers.

For the "real" recommended number of workers, we probably have to monitor the performance the workers have in order to do their task. If the workers take 1 second on average to do the task, we could spawn workers during that second, but going further than that will be overkill because the task that the 77th worker will do could be done by the 1st worker that have already finished its task.

In addition, for a real parallelism of the task, we'd need that each worker runs in a different CPU. If we have 4 CPUs available, having 4 workers make sense assuming each worker lands in a different CPU (not sure about the guarantees though). This would mean that the 4 workers would do their task in parallel.
Creating more workers won't add a substantial performance gain because the gain would depend on the CPU usage percentage the worker has relatively with the rest of the app. For example, if the app has 10 threads and 1 of then is a worker, the CPU usage assigned to that worker is expected to be 1/10; if we add 10 more workers, the usage would be 11/20, so 50% of the CPU time is expected to be assigned to perform the job instead of the initial 10%; of course assuming that all the threads aren't sleeping.

I'd recommend:

  • runtime.NumCPU() * 2 if we want to push for performance without worrying about memory (pretty sure still lower than 100)
  • runtime.NumCPU() for performance
  • 4, or lower if we worry about memory usage (assuming 4 is still lower than the number of CPU available)

@jvillafanez
Copy link
Member

On systems with limited memory this causes to trigger the garbage collector often resulting in it becoming blocking.

I have a different theory. The garbage collector isn't blocked, but it can't free memory because all of the memory is being used.

We have 100 workers picking tasks from a queue (or channel in this case). Each worker might need to use 10MB (made up number) of memory per task or even more depending on the task (getting a list of 1000 files will use more memory than getting just 10 files, even if it's just holding the data in memory).
The problem is that each of those workers won't free that memory until he picks a new task (resulting in a new loop), or the queue / channel closes and the thread finishes. Picking a new task would cause freeing the 10MB, but it might use 5MB or 14MB for the new task.

I guess this is why it's tricky to reproduce: memory usage depends on the data we're retrieving, and we have no control about which worker will handle the task, so maybe only half of the workers are used sometimes, which would reduce the memory usage and might not hit the limit.

@dragonchaser
Copy link
Member

dragonchaser commented Jan 30, 2024

@jvillafanez In generall I agree, but the main goal of the 100 was to counter network latency issues. If you have a remote fs like s3 most of the processes will be in a waiting state until there is a response from the remote. So technically it would make sense to "overcommit" the cpu... Also if you run as single binary, the calculation runtime.NumCPU() * 2 might be too optimistic.

@dragonchaser
Copy link
Member

@jvillafanez For this case I see this as resolved, any further discussion on best practices happening in owncloud/docs-ocis#702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Projects
Archived in project
Development

No branches or pull requests

6 participants