Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

Closed
5 tasks done
phiresky opened this issue Aug 3, 2024 · 1 comment · Fixed by #4957
Labels
bug Something isn't working

Comments

@phiresky
Copy link
Collaborator

phiresky commented Aug 3, 2024

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Lemmy tries to extract OpenGraph metadata from URLs referenced in posts. If the post URL is a direct link to a large binary file, it still downloads the whole file, removes all non-utf8 characters and runs a HTML parser on it:

pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> {
info!("Fetching site metadata for url: {}", url);
let response = context.client().get(url.as_str()).send().await?;
let content_type: Option<Mime> = response
.headers()
.get(CONTENT_TYPE)
.and_then(|h| h.to_str().ok())
.and_then(|h| h.parse().ok());
// Can't use .text() here, because it only checks the content header, not the actual bytes
// https://github.com/LemmyNet/lemmy/issues/1964
let html_bytes = response.bytes().await.map_err(LemmyError::from)?.to_vec();
let opengraph_data = extract_opengraph_data(&html_bytes, url)
.map_err(|e| info!("{e}"))

fn extract_opengraph_data(html_bytes: &[u8], url: &Url) -> LemmyResult<OpenGraphData> {
let html = String::from_utf8_lossy(html_bytes);
let mut page = HTML::from_string(html.to_string(), None)?;

This is a very expensive call for large binary files

Steps to Reproduce

  1. Start a Lemmy instance.
  2. Call curl -v 'localhost:8536/api/v3/post/site_metadata?url=https://i.redd.it/tdnjprab04gd1.gif' (warning: that's a 20MB NSFW gif)
  3. Observe 100% cpu for 20s

Technical Details

happens locally but Tiff (reddthat.com) observes this regularily in production

Version

0.19.5

Lemmy Instance URL

reddthat.com

@phiresky phiresky added the bug Something isn't working label Aug 3, 2024
@phiresky
Copy link
Collaborator Author

phiresky commented Aug 3, 2024

My solution to this would be:

  1. Always only fetch the first 16kB of a URL, not the whole thing. i think this is common practice for metadata extraction but not 100% sure.
  2. Check whether the returned data is binary. I would simply check whether it contains at least one null byte (this is the method that ripgrep uses to detect binary data as well). If it is binary, don't run the extraction.

The relevant code was restructured in #4035 but I'm not sure whether it existed before or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant