[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

phiresky · 2024-08-03T11:13:16Z

Requirements

Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
Did you check to see if this issue already exists?
Is this only a single bug? Do not put multiple bugs in one issue.
Do you agree to follow the rules in our Code of Conduct?
Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Lemmy tries to extract OpenGraph metadata from URLs referenced in posts. If the post URL is a direct link to a large binary file, it still downloads the whole file, removes all non-utf8 characters and runs a HTML parser on it:

lemmy/crates/api_common/src/request.rs

Lines 45 to 60 in d09854a

    
           pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> { 
        
             info!("Fetching site metadata for url: {}", url); 
        
             let response = context.client().get(url.as_str()).send().await?; 
        
             let content_type: Option<Mime> = response 
        
               .headers() 
        
               .get(CONTENT_TYPE) 
        
               .and_then(|h| h.to_str().ok()) 
        
               .and_then(|h| h.parse().ok()); 
        
             // Can't use .text() here, because it only checks the content header, not the actual bytes 
        
             // https://github.com/LemmyNet/lemmy/issues/1964 
        
             let html_bytes = response.bytes().await.map_err(LemmyError::from)?.to_vec(); 
        
             let opengraph_data = extract_opengraph_data(&html_bytes, url) 
        
               .map_err(|e| info!("{e}"))

lemmy/crates/api_common/src/request.rs

Lines 129 to 132 in d09854a

    
           fn extract_opengraph_data(html_bytes: &[u8], url: &Url) -> LemmyResult<OpenGraphData> { 
        
             let html = String::from_utf8_lossy(html_bytes); 
        
             let mut page = HTML::from_string(html.to_string(), None)?;

This is a very expensive call for large binary files

Steps to Reproduce

Start a Lemmy instance.
Call curl -v 'localhost:8536/api/v3/post/site_metadata?url=https://i.redd.it/tdnjprab04gd1.gif' (warning: that's a 20MB NSFW gif)
Observe 100% cpu for 20s

Technical Details

happens locally but Tiff (reddthat.com) observes this regularily in production

Version

0.19.5

Lemmy Instance URL

reddthat.com

The text was updated successfully, but these errors were encountered:

phiresky · 2024-08-03T11:15:07Z

My solution to this would be:

Always only fetch the first 16kB of a URL, not the whole thing. i think this is common practice for metadata extraction but not 100% sure.
Check whether the returned data is binary. I would simply check whether it contains at least one null byte (this is the method that ripgrep uses to detect binary data as well). If it is binary, don't run the extraction.

The relevant code was restructured in #4035 but I'm not sure whether it existed before or not.

phiresky added the bug Something isn't working label Aug 3, 2024

phiresky mentioned this issue Aug 3, 2024

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

Merged

dessalines closed this as completed in #4957 Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

phiresky commented Aug 3, 2024 •

edited

Loading

phiresky commented Aug 3, 2024 •

edited

Loading

[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

[Bug]: extract_opengraph_data tries to parse HTML on large binary URLs (images/gifs/videos) and causes CPU spikes delaying all responses #4956

Comments

phiresky commented Aug 3, 2024 • edited Loading

Requirements

Summary

Steps to Reproduce

Technical Details

Version

Lemmy Instance URL

phiresky commented Aug 3, 2024 • edited Loading

phiresky commented Aug 3, 2024 •

edited

Loading

phiresky commented Aug 3, 2024 •

edited

Loading