-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCIS becomes unresponsive when browsing images in frontend #6874
Comments
@siment Thanks for reporting that. We know that the thumbnail service needs memory. This is what you see in the log file We need to take a look. |
@micbar Thanks for acknowledging the report |
As an added suggestion for enhancement: Orchestrate thumbnail requests in such a way that thumbnails are not generated on the file list page, while still sending only one request on image detail page to preview a single image. This would be an acceptable interim user experience IMO. Though I suspect even generating a single image is excessively resource intensive and should be looked into regardless. |
Why isn't this labeled as a bug anymore? The thumbnail overkill leads to a completely unresponsive ocis server that needs to be restarted - which qualifies as buggy imho. To be fair: Every fixed bug is an enhancement ;) |
My proposal would be a (per user and global) rate limit for generating thumbnails. Retrieving thumbnails from cache might be limited by a different value. Note: In the single process mode (`ocis server´) you can kill a whole oCIS Instance by exhausting resources when calculating thumbnails of eg. large images. This equals a DoS attack (but requires authenticated context). For the oCIS Helm Chart this issue is not that pressing since you can limit the thubnails service's resource usage individually (https://github.com/owncloud/ocis-charts/blob/563af105e62fed6b1eac33336edf1521db2dc8e3/charts/ocis/values.yaml#L1470-L1471). For production single process deployments ( |
in the single binary we could move services to explicit os threat and give them a higher niceness: https://github.com/vijayviji/executor (pretty dated) However, I don't want to build a docker replacement ... maybe we should rethink the single binary use case? and just go 100% kubernetes? |
I think in the long rung we can't go without rate limiting / throttling thumbnails (and probably other API endpoints, too). Making only the thumbnails service slow / dying can only be considered as a workaround in my opinion. |
I got hit by this issue today. It totally locked up the server at 100% CPU and RAM usage. I had to force a cold reboot. The server specifications are the same as OP. |
I faced also the issue of high load and failing thumbnail requests. I figured, scheduling thumbnail generation on-upload with controlled resources might avoid this issue in my case. I started a thread in the oc forum to ask whether there is a way to enable thumbnail pre-generation. Any thoughts on this? |
"controlled resource" is the key aspect from my point of view. I honestly prefer the simple lazy-thumbnails-cache generation mechanism of the thumbnails service for operations simplicity when previews have a low QoS. If a higher QoS is needed, a optional post-processing thumbnails generation step would be a good fit. |
Also we could compare the performance of the resize library that we use besides a lilliput https://github.com/discord/lilliput |
fixed by #9199 |
Describe the bug
OCIS becomes unresponsive if Thumbnails service is running while browsing images in web frontend. I expect generating thumbnails to be resource intensive, but this just seems way too inefficient considering the hardware the system is running on. For me, this is a prohibitive factor before scaling the service to production and inviting external users.
Steps to reproduce
Setting up the debugging session
First upload 70-ish ~5 MB images to a folder in ownCloud.
Clear your thumbnails directory:
rm -Rf /var/lib/ocis/thumbnails/*
exit
In
/etc/ocis/ocis.env
:sudo tail -f /var/log/syslog
sudo journalctl -u ocis -f
sudo htop
Expected behavior
Actual behavior
Symptoms
Having done the steps above, here are the symptoms I am able to observe while browsing ownCloud frontend:
When opening folder with images in frontend
Htop and SystemD
Htop shows 100% load on both CPUs. This can last for tens of seconds or even a couple of minutes. Sometimes system is under so much stress that systemd automatically restarts OCIS:
Browser
70-ish requests are sent to OCIS to generate thumbnails. Requests look like this:
Some of the requests succeed thereby showing image in file list, but most fail with the following error:
Jaeger
In Jaeger all traces "explode" with durations of 10 seconds or more. These long durations are because of heavy system load and it is not, for me at least, possible to see exactly which processes create the load on the system.
When opening single image file in web frontend
Even when opening a single image file from web frontend, the same number of requests for thumbnails are sent to OCIS. I am guessing this is for preloading images. All the same symptoms that I have described in the section above, "When opening folder with images in frontend", also apply for opening a single image file.
Setup
System description
Alternative system configurations
All alternative system configurations yielded the same results.
Additional context
I think there is massive room for improvement here. We were able to successfully generate thumbnails in the early 2000's, so there is no reason why we should not be able to achieve the same now.
I am guessing that better orchestration of thumbnail requests is a an obvious approach, but I am not excluding that the process for generating thumbnails itself is massively under-optimized and inefficient.
The text was updated successfully, but these errors were encountered: