Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache fixes and improvements #736

Merged

Conversation

riccardoporreca
Copy link
Collaborator

@riccardoporreca riccardoporreca commented Jul 23, 2022

While trying to look at the issue spotted in #733

the rechecked internal link seems to duplicate the metadata (every run would include new metadata

I took the chance to review the current cache deletion / addition logic also following up on some of the review comments from #712 that affect the (mis)behaviour beyond performance considerations.

This also includes a small extension that would close #737.

I tried to be very detailed with individual commits and verbose messages describing the reason for each change, so this PR is best reviewed by looking at diffs and messages from individual commits.

… expiration

* This ensures additions are directly detected for links whose cache has expired.
* Expiration logged differently from deletions of obsolete URLs (debug level)
* Note that the previous approach would have added the expired cache to the links to check, which is
  * generally inaccurate in terms of metadata update (cached metadata do not reflect the current sources) and corresponding reported failure locations.
  * wrong for the internal cache, where not considering the current metadata would prevent checking a newly-broken link for an existing cache key.

This is addresses the points around metadata consistency and expiration raised when reviewing gjtorikian#712
* By extending an existing test to checking the size of metadata in the cache.
* A duplicated entry in the test fixture, now fixed, also revealed the piling up of metadata.
* This is what was seen e.g. in gjtorikian#733, with metadata for failures to be re-checked piling up across consecutive runs.
* This makes sure metadata for failures to be re-checked are not duplicated and do not pile up across consecutive runs, fixing what seen e.g. in gjtorikian#733.
* We also explicitly ignore the URLs for which no metadata addition is detected.
* This fixes the newly-extended test revealing the issues.
* The overall logic and approach is not changed, but should be easier to follow with focus on what matters, i.e. determining for which detected metadata a "matching" cached metadata exists and was found w/o failure.
* No need to set anything in the metadata for what is found in the cache.
* To better reflect the logic and making it easy to follow in the code.
* This can greatly help tracking the actual behavior of the detection logic.
@riccardoporreca riccardoporreca changed the title Feature/cache v2 improvements Cache fixes and improvements Jul 23, 2022
* Closes gjtorikian#737, as a follow-up of splitting the timeframe in gjtorikian#708.
* All cases are covered by tests.
@gjtorikian
Copy link
Owner

Looks good, thank you!

@gjtorikian gjtorikian merged commit f40cd59 into gjtorikian:main Jul 26, 2022
@riccardoporreca riccardoporreca deleted the feature/cache-v2-improvements branch July 26, 2022 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable internal/external cache altogether if not timeframe is specified
2 participants