Skip to content
This repository has been archived by the owner on Nov 6, 2024. It is now read-only.

Images being repeatedly sideloaded #18

Closed
tamw-wnet opened this issue Jun 14, 2022 · 28 comments
Closed

Images being repeatedly sideloaded #18

tamw-wnet opened this issue Jun 14, 2022 · 28 comments

Comments

@tamw-wnet
Copy link
Member

I've noticed since late December that my site is sideloading several extra copies of the 'primary' image for stories. I've gone from sideloading about 3-5GB of images from NPR stories per month to around 15GB. @jwcounts
I need to do some further review, but I think it might have something to do with the changes around this block of code in NPRAPIWordpress.php starting at line 393:

                  foreach( $attached_images as $att_image ) {
                    // see if the filename is very similar
                    // $att_guid = explode( '.', $att_image->guid );
                    $attach_url = wp_get_attachment_url( $att_image->ID );
                    $attach_url_parse = parse_url( $attach_url );
                    $attach_url_parts = pathinfo( $attach_url_parse['path'] );

                    $imagep_attach_url = wp_get_attachment_url( $image_post->ID );
                    $imagep_url_parse = parse_url( $imagep_attach_url );
                    $imagep_url_parts = pathinfo( $imagep_url_parse['path'] );
                    // so if the already attached image name is part of the name of the file
                    // coming in, ignore the new/temp file, it's probably the same
                    if ( strtolower( $attach_url_parts['filename'] ) == strtolower( $imagep_url_parts['filename'] ) ) {
                      @unlink( $file_array['tmp_name'] );
                      wp_delete_attachment( $image_upload_id );
                      $file_OK = FALSE;
                    }
                  }

I'm guessing what's happening is that the previous filename matching logic is breaking somehow, so every time there's a revision to a story the new image gets created.

This could well be some issue on the API end as well -- I was on vacation the week we start seeing this behavior so a code update making its way through seems.... unlikely.

@jwcounts
Copy link

Let me know what you find out, and I can start doing some inspection on my end.

@tamw-wnet
Copy link
Member Author

here's an example -- from NPR Story ID 1087462333 on March 19 I have 3 identical versions of image 0b6a9874_custom-a4f9308ca707a02ec5dcd6c9d1a3a032525f9a8d.jpg

and I have 9 versions of this March 10 nursing-related image
20220124-nprnursing-0106-edit_custom-46609f71513d9b0bd45791c9230ee8ad9624136e.jpg

12 versions of a March 3 Wuhan related image
2022_03_03_wuhanmarket-1193097829-32a0b94cbc67e11c0b6f01e030b1a2cbb53de86f.jpg

I'd be interested if you guys see the same multiple downloads for these (or other) images.

@jwcounts
Copy link

I haven't been able to verify any on my end. Do you have any recorded duplicates after March 30, 2022?

I committed a change to the repo on March 30th, which changed line 391 in NPRAPIWordpress.php. Previously, that line was using the internal Wordpress ID for the imported article, instead of the ID of the imported image. wp_get_attachment_url() will return false if the ID referenced is not an attachment, so no duplicates would ever be found.

@tamw-wnet
Copy link
Member Author

Here's two images from June 13
ap22164713613679_custom-be6cbaed21844e94defba9d0a2508a68a05207c4-3-scaled.jpg from "Here's every word of the second Jan. 6 committee hearing on its investigation" story id 1104690690

This image has 9 versions -- ap22145785567460_wide-dfbbced2e2a33aa5a0899ca7d6cd6f257e13ed37-9-scaled.jpg its from June 13 "Republican primaries show that Trump voters don't always follow his endorsements" story id 1103956855

@tamw-wnet
Copy link
Member Author

I'm using 1.9.3.1, I updated to it on June 8

@tamw-wnet
Copy link
Member Author

I've got a feeling the issue is that this is failing:

if ( strtolower( $attach_url_parts['filename'] ) == strtolower( $imagep_url_parts['filename'] ) ) {

I'm going to error_log those two values for a couple of days and see what I see.

@jwcounts
Copy link

jwcounts commented Jun 15, 2022

Okay, previous theory busted. I'll run some tests with the 2 above you referenced.

Also, comparing the filenames you referenced with the output from the API, (example: ap22164713613679_custom-be6cbaed21844e94defba9d0a2508a68a05207c4-3-scaled.jpg) I get where "-3" is coming from (since it's the third copy), but where is the "-scaled" coming from? With the function being used, Wordpress should be returning the URL to the original full-sized file, and not a scaled or resized variant.

@tamw-wnet
Copy link
Member Author

Don't worry about the '-scaled' I was using '-8-scaled' as a grep criteria against ls

@tamw-wnet
Copy link
Member Author

I'm starting to think the issue is earlier -- you're automatically sideloading the image and creating an attachment always, but then you compare the filenames of the new attachment with the old attachments; if they match you delete the new attachment. However, because the new image has already been downloaed and the filename set to incremented '-version.jpg' the match always fails.
The comparison needs to be done against the original filename before download.

I'll work on a patch.

@jwcounts
Copy link

Yeah, good point. I think that logic was created before our time, though I touched it last when I swapped around what IDs were being checked. Doing the filename check earlier sounds like a good idea.

@tamw-wnet
Copy link
Member Author

Agreed, it looks like it was 7 years ago.

@jwcounts
Copy link

I think I have a solution cooked up if need be. Otherwise, I will defer to yours.

@tamw-wnet
Copy link
Member Author

tamw-wnet commented Jun 15, 2022 via email

@jwcounts
Copy link

Okay, sent the commit to the repo. Basically made it loop through attached images right after determining $image_url from the API. If any of the attached images match the filename in $image_url, that API image is skipped.

If no attached images match, then the download/sideload/etc. actually happens. Looks like it will work, but need to run some tests.

@jwcounts
Copy link

Nope, too aggressive. My testing site isn't importing images at all now.

@jwcounts
Copy link

Fixed it. Forgot that 'filename' from pathinfo() doesn't include the file extension, whereas 'basename' does

@tamw-wnet
Copy link
Member Author

I think a related issue is introduced by the '-scaled' thing -- a built-in WordPress feature since 5.3.

Here's some debug output I got from a statement right after line 403 -- $file_array is the parsed $image_url, $attach_url_parts is from the post's existing attached images, and $imagep_url_parts is from the newly sideloaded image.

$file_array['name'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167.jpg
$attach_url_parts['filename'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167-scaled
$imagep_url_parts['filename'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167-1-scaled

@tamw-wnet
Copy link
Member Author

We need an additional arg to the two calls to wp_get_attachment_url() -- in both cases we need to specify the size, 'full' aka the original downloaded size
eg
$imagep_attach_url = wp_get_attachment_url( $image_post->ID , 'full');

@jwcounts
Copy link

jwcounts commented Jun 15, 2022

Unfortunately, that function only accepts the ID. However, they introduced a new function in 5.3: wp_get_original_image_url(). It also only accepts the ID, but otherwise, it functions just like wp_get_attachment_url(). I did a preliminary test on my end and it seems to work, so I'll be curious what kind of outcomes you see on your end. I updated NPRAPIWordpress.php and committed it to the repo.

@tamw-wnet
Copy link
Member Author

I'd actually thought we were using wp_get_attachment_image_src() but doh. I'll check it out.

@tamw-wnet
Copy link
Member Author

Well, I'm still seeing the issue. I added a little bit of debugging to see what's happening -- line 378 right before the decision to 'continue' I added
error_log($attach_url_parts['filename'] . " " . $imagep_url_parts['filename']);

Here's the output from the most recent run:

[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918
[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-1 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918
[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-2 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918

However, "bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-3" ended up being added to the Media Library. I'll see if I can figure out what is going on and why it's still getting sideloaded.

@tamw-wnet
Copy link
Member Author

actually I already see the issue, the 'continue' doesn't actually stop the download from happening in line 385.

@jwcounts
Copy link

You're right. The continue pops you out of the one loop, but not the greater one. What about this? Starting at 369:

$attach_match = false;
if ( !empty( $attached_images ) ) {
	foreach( $attached_images as $att_image ) {
		$attach_url = wp_get_original_image_url( $att_image->ID );
		$attach_url_parse = parse_url( $attach_url );
		$attach_url_parts = pathinfo( $attach_url_parse['path'] );

		if ( strtolower( $attach_url_parts['filename'] ) === strtolower( $imagep_url_parts['filename'] ) ) {
			$attach_match = true;
		}
	}
}
if ( $attach_match ) {
	continue;
}

@tamw-wnet
Copy link
Member Author

easier would be to just change the orig continue to 'continue 2' I think

@tamw-wnet
Copy link
Member Author

I'm testing the 'continue 2' approach on my server now, with debugging, I'll let you know how it goes

@jwcounts
Copy link

Oh wow, I didn't know you could do that. Makes sense though.

@tamw-wnet
Copy link
Member Author

Looking good! Just got 5 story updates, none of them got downloaded, but a new story got its image.

@jwcounts
Copy link

Nice! I've got a couple of small bug fixes queued for a release, so this will round it out pretty well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants