Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash-based caching #11

Merged
merged 17 commits into from
Sep 8, 2024
9 changes: 5 additions & 4 deletions src/Cache/BaseDataCache.php
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,11 @@ public function get_data(string $key, $default = null)
return $default;
}

// ignore data if internal cache expiration time is expired
if ($data['__cache_expiration_time'] < time()) {
return $default;
}
// FreshRSS commented out, to allow HTTP 304
// // ignore data if internal cache expiration time is expired
// if ($data['__cache_expiration_time'] < time()) {
// return $default;
// }
Comment on lines +58 to +62
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 This looks suspicious, probably broken in the current version of SimplePie. This seems to disable the possibility of conditional requests (HTTP 304).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be referring to simplepie#846

Copy link
Member Author

@Alkarex Alkarex Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 I think the logic is wrong. The cache is destroyed before the conditional request having a chance to kick-in. So you will (almost) never receive HTTP 304 responses.
For instance, I use a 15-minute cache. Within the next 15 minutes, the cache should be used without any request. After the 15 minutes, a conditional request should be emitted (if information was available), to potentially allow for an HTTP 304 Not Modified response.
This code very much seems to break this, by wrongly destroying the cache after the 15 minutes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, because this mimics the way PSR-16 works, see https://www.php-fig.org/psr/psr-16/#12-definitions

Expiration - The actual time when an item is set to go stale. This is calculated by adding the TTL to the time when an object is stored.

An item with a 300 second TTL stored at 1:30:00 will have an expiration of 1:35:00.

Implementing Libraries MAY expire an item before its requested Expiration Time, but MUST treat an item as expired once its Expiration Time is reached.

If set_data($data, $ttl) is called with $ttl=15 min, than the cache has to return null (or $default), but not a potential expired value.

Copy link

@Art4 Art4 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround would be to set the $ttl to 30 min, but send a request after 15 min and check for HTTP 304 response. If 304 is returned, than the cache should be written for another 30 min.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in a position to do that at the moment

No worries :-)

'cache_expiration_time' => $this->cache_duration - 300 + time(),`

I am not sure where the 300 is coming from, but it should not be hardcoded.

One can even rename it to check_rss_feed_after

Yes, it looks to me like a bit of refactoring would be needed there. I still think the current behaviour is buggy.
Ideally, it should evolve to be HTTP-compliant, which is not the case at the moment, by properly supporting Cache-Control: max-age and other related headers.

or even more custom values

Yes, that would be needed.

I can create a PR with a minimal PSR-16 implementation based on the File cache adapter as a starting point

That would be much welcome 👍🏻 It is still unclear to me how much can be put there without changing SimplePie core's code.

When this clearing is executed is part of the cache implementation and could be triggered by a cronjob

Yes, this is what we have at the moment in FreshRSS (prior to those SimplePie changes), but without needing those extra steps:

https://github.com/FreshRSS/FreshRSS/blob/0f2023811b38fdcbc7d8ade95114ca9403636af5/lib/lib_rss.php#L418-L434

Copy link

@Art4 Art4 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure where the 300 is coming from, but it should not be hardcoded.

The 300 was only an example.

I can create a PR with a minimal PSR-16 implementation based on the File cache adapter as a starting point

That would be much welcome 👍🏻 It is still unclear to me how much can be put there without changing SimplePie core's code.

Ok, I will create a PR in FreshRSS repo based on the allow-simplepie-updates-with-composer branch.

You can see the first results here: https://github.com/Art4/FreshRSS/compare/allow-simplepie-updates-with-composer...Art4:FreshRSS:create-psr-16-cache?expand=1

When this clearing is executed is part of the cache implementation and could be triggered by a cronjob

Yes, this is what we have at the moment in FreshRSS (prior to those SimplePie changes), but without needing those extra steps:

https://github.com/FreshRSS/FreshRSS/blob/0f2023811b38fdcbc7d8ade95114ca9403636af5/lib/lib_rss.php#L418-L434

It should be possible to move this logic into the Cache implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alkarex See Art4/FreshRSS#1 for a PSR-16 implementation based on the file cache driver from SimplePie.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts 👍🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Art4 Could you please change the target of your PR to https://github.com/FreshRSS/simplepie/tree/freshrss
I am not sure I will be able to use it as it does not seem to allow offloading much of my needed patches, while adding 500 lines of boilerplate, but I will try to give it another look later.
In particular, I still fail to see so far how to achieve your claim of Makes most changes in the cache class of the SimplePie fork not necessary


// remove internal cache expiration time
unset($data['__cache_expiration_time']);
Expand Down
59 changes: 55 additions & 4 deletions src/SimplePie.php
Original file line number Diff line number Diff line change
Expand Up @@ -1634,14 +1634,48 @@ public function enable_exceptions(bool $enable = true)
$this->enable_exceptions = $enable;
}

/**
* Computes a hash of the raw feed content,
* after having cleaned it from noisy elements such as statistics or comments.
* FreshRSS
* @return string $rss A hash of the cleaned content, or empty string in case of error.
*/
function clean_hash(string $rss): string
{
if ($rss === '') {
return '';
}
//Process by chunks not to use too much memory
if (($stream = fopen('php://temp', 'r+'))
&& fwrite($stream, $rss)
&& rewind($stream)
) {
$ctx = hash_init('sha1');
while ($stream_data = fread($stream, 1048576)) {
hash_update(
$ctx, preg_replace(
[
'#<(lastBuildDate|pubDate|updated|feedDate|dc:date|slash:comments)>[^<]+</\\1>#',
'#<(media:starRating|media:statistics) [^/<>]+/>#',
'#<!--.+?-->#s',
], '', $stream_data
)
);
}
fclose($stream);
return hash_final($ctx);
}
return '';
}

/**
* Initialize the feed object
*
* This is what makes everything happen. Period. This is where all of the
* configuration options get processed, feeds are fetched, cached, and
* parsed, and all of that other good stuff.
*
* @return bool True if successful, false otherwise
* @return bool|int positive integer with modification time if using cache, boolean true if otherwise successful, false otherwise // FreshRSS
*/
public function init()
{
Expand Down Expand Up @@ -1729,7 +1763,7 @@ public function init()

// Fetch the data into $this->raw_data
if (($fetched = $this->fetch_data($cache)) === true) {
return true;
return $this->data['mtime'] ?? true; // FreshRSS
} elseif ($fetched === false) {
return false;
}
Expand Down Expand Up @@ -1803,6 +1837,8 @@ public function init()
$this->data['headers'] = $headers;
}
$this->data['build'] = Misc::get_build();
$this->data['hash'] = $this->data['hash'] ?? $this->clean_hash($this->raw_data); // FreshRSS
$this->data['mtime'] = time(); // FreshRSS

// Cache the file if caching is enabled
$this->data['cache_expiration_time'] = $this->cache_duration + time();
Expand Down Expand Up @@ -1850,6 +1886,7 @@ public function init()
*
* @param Base|DataCache|false $cache Cache handler, or false to not load from the cache
* @return array{array<string, string>, string}|array{}|bool Returns true if the data was loaded from the cache, or an array of HTTP headers and sniffed type
* @phpstan-impure
*/
protected function fetch_data(&$cache)
{
Expand Down Expand Up @@ -1901,7 +1938,8 @@ protected function fetch_data(&$cache)
$this->data = [];
}
// Check if the cache has been updated
elseif (isset($this->data['cache_expiration_time']) && $this->data['cache_expiration_time'] > time()) {
// elseif (isset($this->data['cache_expiration_time']) && $this->data['cache_expiration_time'] > time()) { // FreshRSS
elseif (empty($this->data['mtime']) || $this->data['mtime'] + $this->cache_duration <= time()) {
// Want to know if we tried to send last-modified and/or etag headers
// when requesting this file. (Note that it's up to the file to
// support this, but we don't always send the headers either.)
Expand Down Expand Up @@ -1942,6 +1980,18 @@ protected function fetch_data(&$cache)
return true;
}
}
if (isset($file)) { // FreshRSS
$hash = $this->clean_hash($file->get_body_content());
if (($this->data['hash'] ?? null) === $hash) {
$this->data['headers'] = array_map(function (array $values): string {
return implode(',', $values);
}, $file->get_headers());
$cache->set_data($cacheKey, $this->data, $this->cache_duration);
return true; // Content unchanged even though server did not send a 304
} else {
$this->data['hash'] = $hash;
}
}
}
// If the cache is still valid, just return true
else {
Expand Down Expand Up @@ -2069,6 +2119,8 @@ protected function fetch_data(&$cache)
'feed_url' => $file->get_final_requested_uri(),
'build' => Misc::get_build(),
'cache_expiration_time' => $this->cache_duration + time(),
'hash' => empty($hash) ? $this->clean_hash($file->get_body_content()) : $hash, // FreshRSS
'mtime' => time(), // FreshRSS
];

if (!$cache->set_data($cacheKey, $this->data, $this->cache_duration)) {
Expand All @@ -2081,7 +2133,6 @@ protected function fetch_data(&$cache)
}

$this->raw_data = $file->get_body_content();
$this->raw_data = trim($this->raw_data);
$this->permanent_url = $file->get_permanent_uri();

$headers = [];
Expand Down
3 changes: 2 additions & 1 deletion tests/Integration/CachingTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,8 @@ public function testInitWithDifferentCacheStateCallsCacheCorrectly(
$expectedDataWritten['cache_expiration_time'] = $writtenData['cache_expiration_time'];
}

$this->assertSame($expectedDataWritten, $writtenData);
// Test that $expectedDataWritten is a subset of $writtenData
$this->assertEmpty(array_udiff_assoc($expectedDataWritten, $writtenData, function ($a, $b) { return json_encode($a) <=> json_encode($b); })); // FreshRSS
}

/**
Expand Down
21 changes: 11 additions & 10 deletions tests/Unit/Cache/BaseDataCacheTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -69,19 +69,20 @@ public function testGetDataWithCacheMissReturnsDefault(): void
$this->assertSame($default, $cache->get_data($key, $default));
}

public function testGetDataWithCacheExpiredReturnsDefault(): void
{
$key = 'name';
$cachedValue = ['__cache_expiration_time' => 0];
$default = new stdClass();
// FreshRSS commented out, to allow HTTP 304
// public function testGetDataWithCacheExpiredReturnsDefault(): void
// {
// $key = 'name';
// $cachedValue = ['__cache_expiration_time' => 0];
// $default = new stdClass();

$baseCache = $this->createMock(Base::class);
$baseCache->expects($this->once())->method('load')->willReturn($cachedValue);
// $baseCache = $this->createMock(Base::class);
// $baseCache->expects($this->once())->method('load')->willReturn($cachedValue);

$cache = new BaseDataCache($baseCache);
// $cache = new BaseDataCache($baseCache);

$this->assertSame($default, $cache->get_data($key, $default));
}
// $this->assertSame($default, $cache->get_data($key, $default));
// }

public function testGetDataWithCacheCorruptionReturnsDefault(): void
{
Expand Down
Loading