OCIS becomes unresponsive when browsing images in frontend #6874

siment · 2023-07-24T08:54:34Z

Describe the bug

OCIS becomes unresponsive if Thumbnails service is running while browsing images in web frontend. I expect generating thumbnails to be resource intensive, but this just seems way too inefficient considering the hardware the system is running on. For me, this is a prohibitive factor before scaling the service to production and inviting external users.

Steps to reproduce

Setting up the debugging session

First upload 70-ish ~5 MB images to a folder in ownCloud.
Clear your thumbnails directory:

sudo bash

rm -Rf /var/lib/ocis/thumbnails/*

exit

Enable tracing:

In /etc/ocis/ocis.env:

OCIS_TRACING_ENABLED=true

Make sure Jaeger is running. For me that would be:

sudo systemctl start jaeger

Restart OCIS:

sudo systemctl restart ocis

Tail system logs: sudo tail -f /var/log/syslog
Tail JournalD logs: sudo journalctl -u ocis -f
Open Htop: sudo htop
Open Jaeger UI
Optional: Open external application monitoring dashboard
Go to ownCloud web front end.
Open Chrome Developer Tools and activate the network tab. Make sure "Preserve log" is checked.
Browse to the folder containing the images.
Scroll down to trigger requests for thumbnails

Expected behavior

Thumbnails are generated on frontend
Server is not under heavy load
OCIS does not restart because of system overload

Actual behavior

Symptoms

Having done the steps above, here are the symptoms I am able to observe while browsing ownCloud frontend:

When opening folder with images in frontend

Htop and SystemD

Htop shows 100% load on both CPUs. This can last for tens of seconds or even a couple of minutes. Sometimes system is under so much stress that systemd automatically restarts OCIS:

systemd[1]: ocis.service: A process of this unit has been killed by the OOM killer.  
systemd[1]: ocis.service: Main process exited, code=killed, status=9/KILL  
systemd[1]: ocis.service: Failed with result 'oom-kill'.

Browser

70-ish requests are sent to OCIS to generate thumbnails. Requests look like this:

[Redacted URI]IMAGE-NAME.JPG?scalingup=0&preview=1&a=1&c=a2b7d065fb898afab734abafc3c686f8&x=36&y=36

Some of the requests succeed thereby showing image in file list, but most fail with the following error:

<d:error xmlns:d="DAV" xmlns:s="http://sabredav.org/ns">
  <s:exception/>
  <s:message>{"id":"go.micro.client","code":408,"detail":"context deadline exceeded","status":"Request Timeout"}</s:message>
</d:error>

Jaeger

In Jaeger all traces "explode" with durations of 10 seconds or more. These long durations are because of heavy system load and it is not, for me at least, possible to see exactly which processes create the load on the system.

When opening single image file in web frontend

Even when opening a single image file from web frontend, the same number of requests for thumbnails are sent to OCIS. I am guessing this is for preloading images. All the same symptoms that I have described in the section above, "When opening folder with images in frontend", also apply for opening a single image file.

Setup

System description

Cloud Ubuntu 22.04 server with 4 GB RAM and 2 vCPUs.
OCIS 3.0.0 running from binary as a bare metal systemd service.
User storage is in S3 bucket on the same network.
Authentication via Authelia OpenID Connect

Alternative system configurations

I did the same tests with Authelia disabled.
I did the same tests with local user storage.
I did the same tests with Authelia disabled and local storage.

All alternative system configurations yielded the same results.

OCIS_BASE_DATA_PATH=/var/lib/ocis  
  
# Base configuration  
OCIS_URL=[redacted]  
OCIS_ASYNC_UPLOADS=true  
OCIS_TRACING_ENABLED=true  
OCIS_TRACING_ENDPOINT=localhost:6831  
OCIS_TRACING_COLLECTOR=http://localhost:14268/api/traces  
OCIS_INSECURE=false  
OCIS_LOG_LEVEL=warn   
OCIS_CONFIG_DIR=/etc/ocis  
OCIS_LOG_LEVEL=warn  
OCIS_LOG_COLOR=true  
OCIS_LOG_PRETTY=true  
STORAGE_USERS_OCIS_ASYNC_UPLOADS=true  
PROXY_HTTP_ADDR=0.0.0.0:9200  
PROXY_TLS=false  
PROXY_OIDC_USERINFO_CACHE_TTL=1800000000000  
IDP_SIGNING_KID=[redacted]  
  
# Using S3 storage  
# See https://doc.owncloud.com/ocis/next/deployment/general/general-info.html#using-s3-for-blobs  
# activate s3ng storage driver  
STORAGE_USERS_DRIVER=s3ng  
# Path to metadata stored on POSIX  
# Not needed because I am setting OCIS_BASE_DATA_PATH  
# STORAGE_USERS_S3NG_ROOT: /var/lib/ocis/storage/users  
# keep system data on ocis storage  
STORAGE_SYSTEM_DRIVER=ocis  
# s3ng specific settings  
STORAGE_USERS_S3NG_ENDPOINT=[redacted]
STORAGE_USERS_S3NG_REGION=[redacted]
STORAGE_USERS_S3NG_ACCESS_KEY=[redacted]  
STORAGE_USERS_S3NG_SECRET_KEY=[redacted]  
STORAGE_USERS_S3NG_BUCKET=[redacted]  
  
# Using Authelia  
OCIS_OIDC_ISSUER=[redacted] 
WEB_OIDC_CLIENT_ID=ownCloud-web  
WEB_OPTION_LOGOUT_URL=[redacted]
WEB_OIDC_POST_LOGOUT_REDIRECT_URI=[redacted]
PROXY_OIDC_REWRITE_WELLKNOWN=true  
# Without this, I got the following errors in the ownCloud log:  
# failed to verify access token: token contains an invalid number of segments  
PROXY_OIDC_ACCESS_TOKEN_VERIFY_METHOD=none

Additional context

I think there is massive room for improvement here. We were able to successfully generate thumbnails in the early 2000's, so there is no reason why we should not be able to achieve the same now.

I am guessing that better orchestration of thumbnail requests is a an obvious approach, but I am not excluding that the process for generating thumbnails itself is massively under-optimized and inefficient.

The text was updated successfully, but these errors were encountered:

micbar · 2023-07-24T09:41:52Z

@siment Thanks for reporting that.

We know that the thumbnail service needs memory.

This is what you see in the log file oom-kill.

We need to take a look.

siment · 2023-07-24T10:10:40Z

@micbar Thanks for acknowledging the report

siment · 2023-07-24T10:41:26Z

As an added suggestion for enhancement:

Orchestrate thumbnail requests in such a way that thumbnails are not generated on the file list page, while still sending only one request on image detail page to preview a single image.

This would be an acceptable interim user experience IMO. Though I suspect even generating a single image is excessively resource intensive and should be looked into regardless.

fpauser · 2023-08-16T19:02:53Z

Why isn't this labeled as a bug anymore? The thumbnail overkill leads to a completely unresponsive ocis server that needs to be restarted - which qualifies as buggy imho.

To be fair: Every fixed bug is an enhancement ;)

wkloucek · 2023-09-05T14:03:27Z

My proposal would be a (per user and global) rate limit for generating thumbnails. Retrieving thumbnails from cache might be limited by a different value.

Note: In the single process mode (`ocis server´) you can kill a whole oCIS Instance by exhausting resources when calculating thumbnails of eg. large images. This equals a DoS attack (but requires authenticated context).

For the oCIS Helm Chart this issue is not that pressing since you can limit the thubnails service's resource usage individually (https://github.com/owncloud/ocis-charts/blob/563af105e62fed6b1eac33336edf1521db2dc8e3/charts/ocis/values.yaml#L1470-L1471).

For production single process deployments (ocis server), I currently would propose to exclude the thumbnails service (eg. OCIS_EXCLUDE_RUN_SERVICES=thubnails) and start it as a separate process and add cgroup limits via eg. systemd or docker to it.

butonic · 2023-10-09T13:50:29Z

in the single binary we could move services to explicit os threat and give them a higher niceness: https://github.com/vijayviji/executor (pretty dated)

However, I don't want to build a docker replacement ... maybe we should rethink the single binary use case? and just go 100% kubernetes?

wkloucek · 2023-10-10T12:36:57Z

However, I don't want to build a docker replacement ... maybe we should rethink the single binary use case? and just go 100% kubernetes?

I think in the long rung we can't go without rate limiting / throttling thumbnails (and probably other API endpoints, too). Making only the thumbnails service slow / dying can only be considered as a workaround in my opinion.

ethan-tqa · 2023-10-11T10:08:53Z

I got hit by this issue today. It totally locked up the server at 100% CPU and RAM usage. I had to force a cold reboot. The server specifications are the same as OP.
It is difficult to imagine that we are struggling to generate thumbnails in 2023.
I think it is okay to have slow thumbnail generation if that can help avoid the issue, files can be queued up and processed one at a time, especially when server is running in single process mode.

tomtana · 2023-11-06T11:44:44Z

I faced also the issue of high load and failing thumbnail requests. I figured, scheduling thumbnail generation on-upload with controlled resources might avoid this issue in my case.

I started a thread in the oc forum to ask whether there is a way to enable thumbnail pre-generation.
https://central.owncloud.org/t/thumbnail-preview-generation-in-ocis/45721

Any thoughts on this?

wkloucek · 2023-11-06T11:58:22Z

I faced also the issue of high load and failing thumbnail requests. I figured, scheduling thumbnail generation on-upload with controlled resources might avoid this issue in my case.

"controlled resource" is the key aspect from my point of view. I honestly prefer the simple lazy-thumbnails-cache generation mechanism of the thumbnails service for operations simplicity when previews have a low QoS. If a higher QoS is needed, a optional post-processing thumbnails generation step would be a good fit.

2403905 · 2023-11-27T15:32:59Z

Also we could compare the performance of the resize library that we use besides a lilliput https://github.com/discord/lilliput

micbar · 2024-05-29T21:33:52Z

fixed by #9199

siment added the Type:Bug label Jul 24, 2023

micbar added Category:Enhancement Add new functionality and removed Type:Bug labels Jul 24, 2023

dragotin mentioned this issue Jan 22, 2024

Memory Crisis with oCIS 4.0.5 #8257

Closed

illode mentioned this issue Mar 11, 2024

Add option to use original image instead of generated thumbnail for previews #8614

Open

micbar closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCIS becomes unresponsive when browsing images in frontend #6874

OCIS becomes unresponsive when browsing images in frontend #6874

siment commented Jul 24, 2023 •

edited

Loading

micbar commented Jul 24, 2023

siment commented Jul 24, 2023

siment commented Jul 24, 2023

fpauser commented Aug 16, 2023 •

edited

Loading

wkloucek commented Sep 5, 2023 •

edited

Loading

butonic commented Oct 9, 2023 •

edited

Loading

wkloucek commented Oct 10, 2023

ethan-tqa commented Oct 11, 2023

tomtana commented Nov 6, 2023

wkloucek commented Nov 6, 2023

2403905 commented Nov 27, 2023

micbar commented May 29, 2024

OCIS becomes unresponsive when browsing images in frontend #6874

OCIS becomes unresponsive when browsing images in frontend #6874

Comments

siment commented Jul 24, 2023 • edited Loading

Describe the bug

Steps to reproduce

Setting up the debugging session

Expected behavior

Actual behavior

Symptoms

When opening folder with images in frontend

Htop and SystemD

Browser

Jaeger

When opening single image file in web frontend

Setup

System description

Alternative system configurations

Additional context

micbar commented Jul 24, 2023

siment commented Jul 24, 2023

siment commented Jul 24, 2023

fpauser commented Aug 16, 2023 • edited Loading

wkloucek commented Sep 5, 2023 • edited Loading

butonic commented Oct 9, 2023 • edited Loading

wkloucek commented Oct 10, 2023

ethan-tqa commented Oct 11, 2023

tomtana commented Nov 6, 2023

wkloucek commented Nov 6, 2023

2403905 commented Nov 27, 2023

micbar commented May 29, 2024

siment commented Jul 24, 2023 •

edited

Loading

fpauser commented Aug 16, 2023 •

edited

Loading

wkloucek commented Sep 5, 2023 •

edited

Loading

butonic commented Oct 9, 2023 •

edited

Loading