Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash-based caching #11

Merged
merged 17 commits into from
Sep 8, 2024
Merged

Hash-based caching #11

merged 17 commits into from
Sep 8, 2024

Conversation

Alkarex
Copy link
Member

@Alkarex Alkarex commented Jun 23, 2024

@Alkarex Alkarex marked this pull request as ready for review August 20, 2024 14:50
@Alkarex Alkarex marked this pull request as draft August 20, 2024 21:40
src/Cache/BaseDataCache.php Outdated Show resolved Hide resolved
@Alkarex Alkarex marked this pull request as ready for review August 21, 2024 20:34
Comment on lines +58 to +62
// FreshRSS commented out, to allow HTTP 304
// // ignore data if internal cache expiration time is expired
// if ($data['__cache_expiration_time'] < time()) {
// return $default;
// }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 This looks suspicious, probably broken in the current version of SimplePie. This seems to disable the possibility of conditional requests (HTTP 304).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be referring to simplepie#846

Copy link
Member Author

@Alkarex Alkarex Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 I think the logic is wrong. The cache is destroyed before the conditional request having a chance to kick-in. So you will (almost) never receive HTTP 304 responses.
For instance, I use a 15-minute cache. Within the next 15 minutes, the cache should be used without any request. After the 15 minutes, a conditional request should be emitted (if information was available), to potentially allow for an HTTP 304 Not Modified response.
This code very much seems to break this, by wrongly destroying the cache after the 15 minutes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, because this mimics the way PSR-16 works, see https://www.php-fig.org/psr/psr-16/#12-definitions

Expiration - The actual time when an item is set to go stale. This is calculated by adding the TTL to the time when an object is stored.

An item with a 300 second TTL stored at 1:30:00 will have an expiration of 1:35:00.

Implementing Libraries MAY expire an item before its requested Expiration Time, but MUST treat an item as expired once its Expiration Time is reached.

If set_data($data, $ttl) is called with $ttl=15 min, than the cache has to return null (or $default), but not a potential expired value.

Copy link

@Art4 Art4 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround would be to set the $ttl to 30 min, but send a request after 15 min and check for HTTP 304 response. If 304 is returned, than the cache should be written for another 30 min.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in a position to do that at the moment

No worries :-)

'cache_expiration_time' => $this->cache_duration - 300 + time(),`

I am not sure where the 300 is coming from, but it should not be hardcoded.

One can even rename it to check_rss_feed_after

Yes, it looks to me like a bit of refactoring would be needed there. I still think the current behaviour is buggy.
Ideally, it should evolve to be HTTP-compliant, which is not the case at the moment, by properly supporting Cache-Control: max-age and other related headers.

or even more custom values

Yes, that would be needed.

I can create a PR with a minimal PSR-16 implementation based on the File cache adapter as a starting point

That would be much welcome 👍🏻 It is still unclear to me how much can be put there without changing SimplePie core's code.

When this clearing is executed is part of the cache implementation and could be triggered by a cronjob

Yes, this is what we have at the moment in FreshRSS (prior to those SimplePie changes), but without needing those extra steps:

https://github.com/FreshRSS/FreshRSS/blob/0f2023811b38fdcbc7d8ade95114ca9403636af5/lib/lib_rss.php#L418-L434

Copy link

@Art4 Art4 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure where the 300 is coming from, but it should not be hardcoded.

The 300 was only an example.

I can create a PR with a minimal PSR-16 implementation based on the File cache adapter as a starting point

That would be much welcome 👍🏻 It is still unclear to me how much can be put there without changing SimplePie core's code.

Ok, I will create a PR in FreshRSS repo based on the allow-simplepie-updates-with-composer branch.

You can see the first results here: https://github.com/Art4/FreshRSS/compare/allow-simplepie-updates-with-composer...Art4:FreshRSS:create-psr-16-cache?expand=1

When this clearing is executed is part of the cache implementation and could be triggered by a cronjob

Yes, this is what we have at the moment in FreshRSS (prior to those SimplePie changes), but without needing those extra steps:

https://github.com/FreshRSS/FreshRSS/blob/0f2023811b38fdcbc7d8ade95114ca9403636af5/lib/lib_rss.php#L418-L434

It should be possible to move this logic into the Cache implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alkarex See Art4/FreshRSS#1 for a PSR-16 implementation based on the file cache driver from SimplePie.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts 👍🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 Could you please change the target of your PR to https://github.com/FreshRSS/simplepie/tree/freshrss
I am not sure I will be able to use it as it does not seem to allow offloading much of my needed patches, while adding 500 lines of boilerplate, but I will try to give it another look later.
In particular, I still fail to see so far how to achieve your claim of Makes most changes in the cache class of the SimplePie fork not necessary

src/SimplePie.php Outdated Show resolved Hide resolved
src/SimplePie.php Outdated Show resolved Hide resolved
@Art4
Copy link

Art4 commented Aug 22, 2024

@Alkarex As you requested in FreshRSS/FreshRSS#4374 (comment) I've looked into this PR.

I think I'm understanding the purpose of introducing the hashing in the clean_hash() method, but I'm afraid that this will not provide the expected benefits since SimplePie 1.8.0. Let me explain:

Cache in SimplePie 1.7.0

Until SimplePie 1.7.0 the cache was using the mtime() and touch() methods to minimize the write processes in the File and Mysql caches. The Memcache and Redis Cache never has this optimization and always wrotes the whole data again into the cache.

The happy path for this optimization works like this: (I hope I get this correct from here.)

  1. SimplePie reads the cache for an RSS feed by calling load() and retrieves the Last-modified header.
  2. SimplePie reads the last-modified timestamp of the value using mtime() on the cache driver.
  3. SimplePie sends a GET request with if-modified-since header to the RSS feed server. If the server responses with 403 the RSS feed is not modified since the last request.
  4. SimplePie tells the cache driver to update the last-modified timestamp of the value by calling touch().
cache driver step 1. reading cache step 2. calling mtime() step 4. calling touch() sum of operations
File get content of file calls mtime() on file calls touch() on file reads: 1 (2)x; writes: 0 (1)x
Memcache reads full cache reads full cache second time reads full cache third time and writes full cache again reads: 3x; writes: 1x
Memcached reads full cache reads full cache second time reads full cache third time and writes full cache again reads: 3x; writes: 1x
MySQL reads full column in db reads only mtime column in db Updates mtime column in db reads: 1 (2)x; writes: 0x
Redis reads full cache reads full cache second time reads full cache third time and writes full cache again reads: 3x; writes: 1x

Cache in SimplePie 1.8.0

In SimplePie 1.8.0 support for PSR-16 cache was introduced and internally the cache only uses get_data(), set_data() and delete_data(). The methods mtime() and touch() are never called anymore.

Now the happy path described above works like this: (see here.)

  1. SimplePie reads the cache for an RSS feed by calling get_data() and retrieves the Last-modified header.
  2. SimplePie reads the cache_expiration_time timestamp of the returned value.
  3. SimplePie sends a GET request with if-modified-since header to the RSS feed server. If the server responses with 403 the RSS feed is not modified since the last request.
  4. SimplePie tells the cache driver to save the data (together with a new cache_expiration_time timestamp) by calling set_data().
cache driver step 1. read cache with get_data() step 2. get cache_expiration_time step 4. update cache with set_data() sum of operations
File get content of file (not called) write content to file reads: 1x; writes: 1x
Memcache reads full cache (not called) updates data in cache reads: 1x; writes: 1x
Memcached reads full cache (not called) updates data in cache reads: 1x; writes: 1x
MySQL reads full column in db (not called) updates full column in db reads: 1x; writes: 1x
Redis reads full cache (not called) updates data in cache reads: 1x; writes: 1x

Conclusion

As you can see the sum of operations changes for all cache drivers. The file drive (that is used by FreshRSS) now has 1x read and 1x write operations. Thats why I recommend to create or use a PSR-16 implementation for FreshRSS. There it should possible to optimize this newly write operation (e.g. if set_data() is called compare the new data with the old data that was loaded on the first get_data() call. ). Also your new hashing functions can be placed in this new driver (if even necessary), keeping your fork cleaner.

@Alkarex
Copy link
Member Author

Alkarex commented Aug 22, 2024

Thanks for the feedback @Art4 . I will try to dig into (I still fail to see how to move some of my patches to a PSR-16 implementation). A more pressing matter is what looks very much like a bug in the current SimplePie, preventing HTTP 304 from working at all #11 (comment)

For now, this PR seems to work for me, except for an array to string conversion in the HTTP headers parsing, which also seems to be a new SimplePie bug, but I have not investigated in details yet.

PHP Warning:  Array to string conversion in ./simplepie/simplepie/src/File.php on line 116

Debug:

{"url":"https://gizmodo.com/feed","key":"if-modified-since","value":["Thu, 22 Aug 2024 06:57:33 GMT"]}
{"url":"https://gizmodo.com/feed","key":"if-none-match","value":["W/\"067d554a3cf80ce6cbe5f2d31da11861\""]}

@Art4
Copy link

Art4 commented Aug 23, 2024

For now, this PR seems to work for me, except for an array to string conversion in the HTTP headers parsing, which also seems to be a new SimplePie bug, but I have not investigated in details yet.

PHP Warning:  Array to string conversion in ./simplepie/simplepie/src/File.php on line 116
{"url":"https://gizmodo.com/feed","key":"if-modified-since","value":["Thu, 22 Aug 2024 06:57:33 GMT"]}
{"url":"https://gizmodo.com/feed","key":"if-none-match","value":["W/\"067d554a3cf80ce6cbe5f2d31da11861\""]}

I think the syntax in the headers is wrong. Arrays are not allowed, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since#syntax

Syntax

If-Modified-Since: <day-name>, <day> <month> <year> <hour>:<minute>:<second> GMT

@Alkarex
Copy link
Member Author

Alkarex commented Aug 23, 2024

Indeed, arrays are not allowed and it should be a string. But I face a case when SimplePie ends up trying to convert an array (which should never be the case) to a string. I have not investigated yet, though, but will follow up on it

@Alkarex
Copy link
Member Author

Alkarex commented Sep 8, 2024

The problem with headers as array seems to come from those changes simplepie#815 which are now mixing headers as string and as array...

Code like that seems also to indicate that there is such a confusion of types:

https://github.com/simplepie/simplepie/blob/9385c54382edec0a768fdbe310c3e580c1b5dc81/src/SimplePie.php#L2845-L2848

            if (is_array($link_headers)) {
                $link_headers = implode(',', $link_headers);
            }

It looks quite messy to me that $this->data['headers'] has a risk of being an array of string or an array of arrays, and there are multiple sections of the code for which it is not obvious which variant it should be.

Anyway, 04e66e0 seems to fix what I need to make it work.

@Alkarex Alkarex merged commit a169376 into freshrss Sep 8, 2024
18 checks passed
@Alkarex Alkarex deleted the cache-hash branch September 8, 2024 17:29
@Alkarex
Copy link
Member Author

Alkarex commented Sep 9, 2024

Minor follow-up #17

@Alkarex Alkarex mentioned this pull request Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants