Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCIS becomes unresponsive when browsing images in frontend #6874

Closed
siment opened this issue Jul 24, 2023 · 12 comments
Closed

OCIS becomes unresponsive when browsing images in frontend #6874

siment opened this issue Jul 24, 2023 · 12 comments
Labels
Category:Enhancement Add new functionality

Comments

@siment
Copy link

siment commented Jul 24, 2023

Describe the bug

OCIS becomes unresponsive if Thumbnails service is running while browsing images in web frontend. I expect generating thumbnails to be resource intensive, but this just seems way too inefficient considering the hardware the system is running on. For me, this is a prohibitive factor before scaling the service to production and inviting external users.

Steps to reproduce

Setting up the debugging session

  1. First upload 70-ish ~5 MB images to a folder in ownCloud.

  2. Clear your thumbnails directory:

sudo bash
rm -Rf /var/lib/ocis/thumbnails/*
exit
  1. Enable tracing:

In /etc/ocis/ocis.env:

OCIS_TRACING_ENABLED=true
  1. Make sure Jaeger is running. For me that would be:
sudo systemctl start jaeger
  1. Restart OCIS:
sudo systemctl restart ocis
  1. Tail system logs: sudo tail -f /var/log/syslog
  2. Tail JournalD logs: sudo journalctl -u ocis -f
  3. Open Htop: sudo htop
  4. Open Jaeger UI
  5. Optional: Open external application monitoring dashboard
  6. Go to ownCloud web front end.
  7. Open Chrome Developer Tools and activate the network tab. Make sure "Preserve log" is checked.
  8. Browse to the folder containing the images.
  9. Scroll down to trigger requests for thumbnails

Expected behavior

  1. Thumbnails are generated on frontend
  2. Server is not under heavy load
  3. OCIS does not restart because of system overload

Actual behavior

Symptoms

Having done the steps above, here are the symptoms I am able to observe while browsing ownCloud frontend:

When opening folder with images in frontend

Htop and SystemD

Htop shows 100% load on both CPUs. This can last for tens of seconds or even a couple of minutes. Sometimes system is under so much stress that systemd automatically restarts OCIS:

systemd[1]: ocis.service: A process of this unit has been killed by the OOM killer.  
systemd[1]: ocis.service: Main process exited, code=killed, status=9/KILL  
systemd[1]: ocis.service: Failed with result 'oom-kill'.

Browser

70-ish requests are sent to OCIS to generate thumbnails. Requests look like this:

[Redacted URI]IMAGE-NAME.JPG?scalingup=0&preview=1&a=1&c=a2b7d065fb898afab734abafc3c686f8&x=36&y=36

Some of the requests succeed thereby showing image in file list, but most fail with the following error:

<d:error xmlns:d="DAV" xmlns:s="http://sabredav.org/ns">
  <s:exception/>
  <s:message>{"id":"go.micro.client","code":408,"detail":"context deadline exceeded","status":"Request Timeout"}</s:message>
</d:error>

Jaeger

In Jaeger all traces "explode" with durations of 10 seconds or more. These long durations are because of heavy system load and it is not, for me at least, possible to see exactly which processes create the load on the system.

When opening single image file in web frontend

Even when opening a single image file from web frontend, the same number of requests for thumbnails are sent to OCIS. I am guessing this is for preloading images. All the same symptoms that I have described in the section above, "When opening folder with images in frontend", also apply for opening a single image file.

Setup

System description

  1. Cloud Ubuntu 22.04 server with 4 GB RAM and 2 vCPUs.
  2. OCIS 3.0.0 running from binary as a bare metal systemd service.
  3. User storage is in S3 bucket on the same network.
  4. Authentication via Authelia OpenID Connect

Alternative system configurations

  1. I did the same tests with Authelia disabled.
  2. I did the same tests with local user storage.
  3. I did the same tests with Authelia disabled and local storage.

All alternative system configurations yielded the same results.

OCIS_BASE_DATA_PATH=/var/lib/ocis  
  
# Base configuration  
OCIS_URL=[redacted]  
OCIS_ASYNC_UPLOADS=true  
OCIS_TRACING_ENABLED=true  
OCIS_TRACING_ENDPOINT=localhost:6831  
OCIS_TRACING_COLLECTOR=http://localhost:14268/api/traces  
OCIS_INSECURE=false  
OCIS_LOG_LEVEL=warn   
OCIS_CONFIG_DIR=/etc/ocis  
OCIS_LOG_LEVEL=warn  
OCIS_LOG_COLOR=true  
OCIS_LOG_PRETTY=true  
STORAGE_USERS_OCIS_ASYNC_UPLOADS=true  
PROXY_HTTP_ADDR=0.0.0.0:9200  
PROXY_TLS=false  
PROXY_OIDC_USERINFO_CACHE_TTL=1800000000000  
IDP_SIGNING_KID=[redacted]  
  
# Using S3 storage  
# See https://doc.owncloud.com/ocis/next/deployment/general/general-info.html#using-s3-for-blobs  
# activate s3ng storage driver  
STORAGE_USERS_DRIVER=s3ng  
# Path to metadata stored on POSIX  
# Not needed because I am setting OCIS_BASE_DATA_PATH  
# STORAGE_USERS_S3NG_ROOT: /var/lib/ocis/storage/users  
# keep system data on ocis storage  
STORAGE_SYSTEM_DRIVER=ocis  
# s3ng specific settings  
STORAGE_USERS_S3NG_ENDPOINT=[redacted]
STORAGE_USERS_S3NG_REGION=[redacted]
STORAGE_USERS_S3NG_ACCESS_KEY=[redacted]  
STORAGE_USERS_S3NG_SECRET_KEY=[redacted]  
STORAGE_USERS_S3NG_BUCKET=[redacted]  
  
# Using Authelia  
OCIS_OIDC_ISSUER=[redacted] 
WEB_OIDC_CLIENT_ID=ownCloud-web  
WEB_OPTION_LOGOUT_URL=[redacted]
WEB_OIDC_POST_LOGOUT_REDIRECT_URI=[redacted]
PROXY_OIDC_REWRITE_WELLKNOWN=true  
# Without this, I got the following errors in the ownCloud log:  
# failed to verify access token: token contains an invalid number of segments  
PROXY_OIDC_ACCESS_TOKEN_VERIFY_METHOD=none  

Additional context

I think there is massive room for improvement here. We were able to successfully generate thumbnails in the early 2000's, so there is no reason why we should not be able to achieve the same now.

I am guessing that better orchestration of thumbnail requests is a an obvious approach, but I am not excluding that the process for generating thumbnails itself is massively under-optimized and inefficient.

@micbar
Copy link
Contributor

micbar commented Jul 24, 2023

@siment Thanks for reporting that.

We know that the thumbnail service needs memory.

This is what you see in the log file oom-kill.

We need to take a look.

@micbar micbar added Category:Enhancement Add new functionality and removed Type:Bug labels Jul 24, 2023
@siment
Copy link
Author

siment commented Jul 24, 2023

@micbar Thanks for acknowledging the report

@siment
Copy link
Author

siment commented Jul 24, 2023

As an added suggestion for enhancement:

Orchestrate thumbnail requests in such a way that thumbnails are not generated on the file list page, while still sending only one request on image detail page to preview a single image.

This would be an acceptable interim user experience IMO. Though I suspect even generating a single image is excessively resource intensive and should be looked into regardless.

@fpauser
Copy link

fpauser commented Aug 16, 2023

Why isn't this labeled as a bug anymore? The thumbnail overkill leads to a completely unresponsive ocis server that needs to be restarted - which qualifies as buggy imho.

To be fair: Every fixed bug is an enhancement ;)

@wkloucek
Copy link
Contributor

wkloucek commented Sep 5, 2023

My proposal would be a (per user and global) rate limit for generating thumbnails. Retrieving thumbnails from cache might be limited by a different value.

Note: In the single process mode (`ocis server´) you can kill a whole oCIS Instance by exhausting resources when calculating thumbnails of eg. large images. This equals a DoS attack (but requires authenticated context).

For the oCIS Helm Chart this issue is not that pressing since you can limit the thubnails service's resource usage individually (https://github.com/owncloud/ocis-charts/blob/563af105e62fed6b1eac33336edf1521db2dc8e3/charts/ocis/values.yaml#L1470-L1471).

For production single process deployments (ocis server), I currently would propose to exclude the thumbnails service (eg. OCIS_EXCLUDE_RUN_SERVICES=thubnails) and start it as a separate process and add cgroup limits via eg. systemd or docker to it.

@butonic
Copy link
Member

butonic commented Oct 9, 2023

in the single binary we could move services to explicit os threat and give them a higher niceness: https://github.com/vijayviji/executor (pretty dated)

However, I don't want to build a docker replacement ... maybe we should rethink the single binary use case? and just go 100% kubernetes?

@wkloucek
Copy link
Contributor

However, I don't want to build a docker replacement ... maybe we should rethink the single binary use case? and just go 100% kubernetes?

I think in the long rung we can't go without rate limiting / throttling thumbnails (and probably other API endpoints, too). Making only the thumbnails service slow / dying can only be considered as a workaround in my opinion.

@ethan-tqa
Copy link

I got hit by this issue today. It totally locked up the server at 100% CPU and RAM usage. I had to force a cold reboot. The server specifications are the same as OP.
It is difficult to imagine that we are struggling to generate thumbnails in 2023.
I think it is okay to have slow thumbnail generation if that can help avoid the issue, files can be queued up and processed one at a time, especially when server is running in single process mode.

@tomtana
Copy link

tomtana commented Nov 6, 2023

I faced also the issue of high load and failing thumbnail requests. I figured, scheduling thumbnail generation on-upload with controlled resources might avoid this issue in my case.

I started a thread in the oc forum to ask whether there is a way to enable thumbnail pre-generation.
https://central.owncloud.org/t/thumbnail-preview-generation-in-ocis/45721

Any thoughts on this?

@wkloucek
Copy link
Contributor

wkloucek commented Nov 6, 2023

I faced also the issue of high load and failing thumbnail requests. I figured, scheduling thumbnail generation on-upload with controlled resources might avoid this issue in my case.

"controlled resource" is the key aspect from my point of view. I honestly prefer the simple lazy-thumbnails-cache generation mechanism of the thumbnails service for operations simplicity when previews have a low QoS. If a higher QoS is needed, a optional post-processing thumbnails generation step would be a good fit.

@2403905
Copy link
Contributor

2403905 commented Nov 27, 2023

Also we could compare the performance of the resize library that we use besides a lilliput https://github.com/discord/lilliput

@micbar
Copy link
Contributor

micbar commented May 29, 2024

fixed by #9199

@micbar micbar closed this as completed May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:Enhancement Add new functionality
Projects
Archived in project
Development

No branches or pull requests

8 participants