Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builtin cache invalidation system aka make API Platform fast as hell #952

Merged
merged 1 commit into from
May 23, 2017

Conversation

dunglas
Copy link
Member

@dunglas dunglas commented Feb 20, 2017

Q A
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets n/a
License MIT
Doc PR todo

The usual quote:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Well, API Platform is an awesome name, so this PR try to solve the other hard thing: caching to make your API as fast as possible.

It introduces a builtin mechanism to always serve API responses from a cache (Varnish and CloudFlare are targeted), and invalid stale data in real time when a resource is updated, deleted or created.
With this mechanism, on my computer and using Docker, large and complex responses are served in ~15ms instead of ~700ms without cache.

The API Platform serializer has been tweaked to store the list of all resources included in a given response (the root document, but also embedded documents and documents appearing in lists).

The response is marked with all resources it contains in the Cache-Tags HTTP header. A md5 hash of all included IRIs is generated to prevent collisions and reduce the header size.
Then, all API's responses are stored with a high expiration time in the proxy cache. On all subsequent (read) requests from a client, the response is served from the proxy, the PHP application is not touched.

When a resource is modified (only changes made to Doctrine entities are supported for now), all responses containing or referencing it are purged from the cache.
A class to purge Varnish is provided in this PR, and a class to purge CloudFlare (enterprise plans only) will be provided later. An interface allows to add support for other cache providers.

Paged collections are also handled: if a resource is added or removed, all collection pages are purged. If a resource is edited, pages (and all other API responses, including those embedding it as a nested document) are purged.

Enabling this new feature doesn't require to change existing code. The following config enable the mechanism and mades your API instantly blazing fast:

api_platform:
    http_cache:
        enable_tags: true
        varnish_url: 'http://my-varnish-proxy'
        shared_max_age: 3600

I've also opened a PR to add Varnish with a compatible setup to the API Platform Docker setup: api-platform/api-platform#238.
Last but not least, this PR introduces some config options to set global cache settings. Example:

api_platform:
    http_cache:
        max_age: 60
        shared_max_age: 3600
        vary: ['Content-Type', 'Cookie']

Note: for advanced needs, prefer the awesome FosHttpCache library.

As stated in the quote, cache invalidation is a hard thing and this PR probably contains bugs and edge cases. Please test it and report any problem.

TODO:

  • Add unit tests

*
* @param GetResponseForControllerResultEvent $event
*/
public function onKernelView(GetResponseForControllerResultEvent $event)
Copy link
Member Author

@dunglas dunglas Feb 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably useless because it is already handled for Doctrine and not reliable enough for non-Doctrine data providers.

}
}

private function addResources(Request $request)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably useless because it is already handled for Doctrine and not reliable enough for non-Doctrine data providers.

@fbourigault
Copy link

This looks amazing, but does providing a solution coupled to specific cache solution as first class citizen a good idea?

Maybe an intermediate solution build using standard Http headers would improve interoperability.

@bendavies
Copy link
Contributor

bendavies commented Feb 20, 2017

Nice work @dunglas.

This is a very similar strategy to what i've implemented previously.
A few comments:

  1. why not just use FosHttpCache instead of implementing your own tagging/purgers etc?
  2. It does not look like this will work with authenticated apis that don't use cookies, i.e just header auth or query param (FosHttpCache has provisions for user specific caching). I'm guessing this would be for the user to implement?
  3. why do you md5 the tags?

composer.json Outdated
@@ -34,6 +34,7 @@
"doctrine/orm": "^2.5",
"doctrine/annotations": "^1.2",
"friendsofsymfony/user-bundle": "^2.0@dev",
"guzzlehttp/guzzle": "^6.0",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use php-http/httplug for better interop.

Copy link
Member Author

@dunglas dunglas Feb 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when it will be a PSR. But for now it has nos advantage over using Guzzle 6 for this use case: it is harder to setup (requires to install an extra Symfony bundle), adds complexity and a (small) performance overhead.
Guzzle 6 is already an abstraction layer, with builtin curl and streams implementations. We only depend of the ClientInterface, it's easy to bridge other implementations (but I think it is not worth it).

As we support PHP 7 only, we don't target older libraries such as Guzzle 5.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch the last point. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dunglas IMHO, you should consider using buzz. And It's not a joke:

  • Guzzle have so many version, and these versions are not compatible each others. So people that are using a version that is incompatible with guzzle 6 will be blocked.
  • Buzz is very simple and works well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it doesn't support async requests... and it has not been updated since 2015.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When guzzle guys will release the 7.0 version, you will be stuck with the 6.0 version. Http clients are a widely common dependency and so conflicts may appears. But ATM you could go with guzzle 6. Using httplug could be added later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a very strong (I think) argument for using HTTPlug: you do not get into hell like FOSHttpCache did (and still is) with their hard coupling to a specific Guzzle version.

@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

@fbourigault there is an abstraction layer and and implementation (Varnish). Basically, anyone can add support for any cache provider supporting cache invalidation (it's a matter of implementing an interface). I plan to add support CloudFlare in core too.

There is no standard (HTTP headers) for cache invalidation (only expiration is supported in RFCs) but I choose to use Cache-Tags because it's the one used by CloudFlare.

@bendavies regarding FosHttpCache, for 2 reasons:

  • my implementation is a bit different (the md5 hash) and tight to the concept of "resources" and "IRIs", not present in FosHttpCache
  • the implementation is trivial (and it's possible to bridge it with FosHttpCache if wanted), it will ease our maintenance process to not have a dependency to a 3rd party library we don't maintain (our soft dependency to FosUser is a pain to maintain...).

It can work with any mechanism of authentification (it's why I've introduced a way to configure the Vary HTTP header). If the resource varies depending of the Authorization header, just set api_platform.http_cache.vary to ['Content-Type', 'Authorization'] (you'll need to tweak the Varnish config too). A key concept of REST is being stateless, a resource identified by an IRI (and some headers specified in Vary) should not vary depending of the current logged in user (in this case, the URL should not be the same).

3/ To avoid collisions (if the a resource is tagged with /foos/,/foos1 and another with only /foos/2, sending a BAN request with /foos as parameter will ban both, and it's a bug. Using md5 hashes prevent this problem. A CRC32 hash should do the trick too, but md5 is usually faster on modern servers.

@fbourigault
Copy link

There is no standard (HTTP headers) for cache invalidation (only expiration is supported in RFCs) but I choose to use Cache-Tags because it's the one used by CloudFlare.

By standard HTTP headers, I mean revalidation. But maybe in such case, caching is not efficient.

@teohhanhui
Copy link
Contributor

teohhanhui commented Feb 20, 2017 via email

@teohhanhui
Copy link
Contributor

teohhanhui commented Feb 20, 2017 via email

@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

I'll make the header configurable but I'll keep CloudFlare compatibility by default, it's an invaluable feature. By the way this header should not be exposed to the end client.

Regarding Vary, it's exactly what I explain in my post 🙂

Regarding the hit rate, doing more advanced things like hashing cannot be automated on the API Platform side, it requires a custom development and it's easy to implement using the new vary option.
Using FosHTTPCache for such usages is also an option, it's up to the developer.

@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

Regex (at least simple regexes) doesn't fix the issue in my example. /foos will match /foos/*, not only the collection response.

@teohhanhui
Copy link
Contributor

teohhanhui commented Feb 20, 2017 via email

@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

But maybe a more complex regex can do the trick, I'll give it a try (I agree that plain tags are better than hashes for debug).

@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

The comma must be handled too.

"""
Then the response status code should be 200
And the header "Cache-Tags" should not exist
And "/relation_embedders/1,/related_dummies/1,/third_levels/1" IRIs should be purged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PUT may change the value of an attribute used to order a responses. You should invalidate collection too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I add this.

And I send a "DELETE" request to "/relation_embedders/1"
Then the response status code should be 204
And the header "Cache-Tags" should not exist
And "/relation_embedders/1,/relation_embedders" IRIs should be purged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about related_dummies who was previously linked to this entity? They should be invalidated too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will be, because they are marked with /relation_embedders/1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

"""
Then the response status code should be 201
And the header "Cache-Tags" should not exist
And "/relation_embedders,/related_dummies,/third_levels,/relation_embedders/1,/related_dummies/1,/third_levels/1" IRIs should be purged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand in you code how you make the link between relation_embedders and third_levels? Who flaged it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario: Tags must be set for items
When I send a "GET" request to "/relation_embedders/1"
Then the response status code should be 200
And the header "Cache-Tags" should be equal to "aa9e2bee5be20590f7dcc520ce2dffca,12a0c94f947a680d68bd6f65e025457d,91774d67418192a057e25dae00345572"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md5 obfuscate tests... :(
I wonder if a dedicated context could be more readable: the cache-tag header should be equals to "/relation_embedders, ...."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'm working on a fix to remove the hash and use plain IRIs. I'll update the PR when it's ready.

->end()
->booleanNode('public')->defaultNull()->info('To make all responses public by default.')->end()
->booleanNode('enable_tags')->defaultFalse()->info('Add cache tags to the response.')->end()
->scalarNode('varnish_url')->defaultNull()->info('URL of the Varnish server to purge using cache tags when a resource is updated.')->end()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget user with several varnish servers

@dunglas dunglas force-pushed the cache_tags branch 2 times, most recently from 2ac0f1e to 603fa12 Compare February 20, 2017 22:55
@dunglas
Copy link
Member Author

dunglas commented Feb 20, 2017

Thank you for the reviews everyone!

  • Plain IRIs are now used instead of md5 hashes (and the regex have been updated accordingly)
  • Collections are purged even when a PUT is done

@dunglas dunglas force-pushed the cache_tags branch 2 times, most recently from ce8419e to 29c2b4f Compare February 20, 2017 23:18
Copy link
Contributor

@Simperfit Simperfit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@soyuka soyuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add unit tests to all those listeners!

public function onFlush(OnFlushEventArgs $eventArgs)
{
$iriConverter = $this->container->get('api_platform.iri_converter');
$resourceManager = $this->container->get('api_platform.http_cache.resource_manager');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular reference is only because of IriConverter no? Can't you inject the ResourceManager properly though?

return;
}

$parts = array_map(function ($iri) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small comment to explain what goes below would be great for future readers!

And I send a "DELETE" request to "/relation_embedders/1"
Then the response status code should be 204
And the header "Cache-Tags" should not exist
And "/relation_embedders/1,/relation_embedders" IRIs should be purged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

<argument type="service" id="api_platform.http_cache.resource_manager" />
<argument type="service" id="api_platform.http_cache.purger" />

<tag name="kernel.event_listener" event="kernel.terminate" method="onKernelTerminate" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to move the purge as close as possible to the transaction (doctrine postFlush IMO)

I understand your point of view to return a response to the user as soon as possible, but this introduce cross concurrency bugs.

Example:

  • list all blogs (cache mis)
  • post a blog
  • list all blog (cache hit when called before the kernel terminate even)

I already have this issue in the my CI (agreed, this is a particular case with a lot of stress tests). but the probability of cross concurrency is higher when:

  • you have several varnishes to purge
  • you have other kernel.terminate events with an higher priority (sending an email for instance)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ That does not actually address the race condition, but only trying to avoid it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, it fixes one case.

BTW (not related to the RC): A small improvment for the user would be to start async request on the postFlush event, and wait on the kernelFinish event.

private $vary;
private $public;

public function __construct(int $maxAge = null, int $sharedMaxAge = null, array $vary, bool $public = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add a LastModified header too initialized to the default value now().
As a result, Varnish will handled it for free (by default), and returns 304 code stuff.

This small change should not interfer in your App and how you invalid cache. But will save bandwidth between the user and varnish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Varnish already returns 304 response properly. We should not send bogus Last-Modified headers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this listener was used only with "tag" thing. I apologize, agreed, this header don't make sens here.

But I think, you missed my point, the idea, when using tagged response, is to add a validation header (could be etag or last-modified, don't care). To save bandwidth between users and varnish when cache is hit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should provide a good default config for this (in the standard edition) but let the user configure it as he wants.

Copy link
Contributor

@teohhanhui teohhanhui Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jderusse:

the idea, when using tagged response, is to add a validation header (could be etag or last-modified, don't care). To save bandwidth between users and varnish when cache is hit.

That's incorrect. Varnish already sends 304 response with proper ETag in that case. There is no need for the backend to send an ETag or Last-Modified header unless the backend intends to return 304 on vcl_backend_fetch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let explain with a schema (my last try)...
Here is the actual implementation
now

My objective is to transform the last response in step 6 into a 304.

  • To do it, varnish need a if-none-match header in step 5
  • To do it, client need a etag header in step 4
  • To do it, varnish needs nothing, it just forward the response from the app from step 3
  • To do it, app must send this header <= this is my point

And Here is the target sequence
then

Your last comment suggest that varnish add this header by it self, with something like that
other

Is it what you mean? Because I doubt varnish add such header by itself (unless you implement this behavior in a VCL which is fine to me too)

Copy link
Contributor

@teohhanhui teohhanhui Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I doubt varnish add such header by itself

Varnish does it by default. Try it. At least it has been doing so since 4.0 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bootstraped an empty api-plateform from this PR api-platform/api-platform#238

curl test2_varnish_1.docker/foos -i
HTTP/1.1 200 OK
Server: nginx/1.11.10
Content-Type: application/ld+json; charset=utf-8
X-Powered-By: PHP/7.1.3
Vary: Content-Type
X-Content-Type-Options: nosniff
X-Frame-Options: deny
Cache-Control: public, s-maxage=3600
Link: <http://test2_varnish_1.docker/docs.jsonld>; rel="http://www.w3.org/ns/hydra/core#apiDocumentation"
Cache-Tags: /foos
Date: Wed, 22 Mar 2017 07:23:50 GMT
X-Varnish: 26 32789
Age: 3
Via: 1.1 varnish-v4
Accept-Ranges: bytes
Content-Length: 111
Connection: keep-alive

{"@context":"\/contexts\/Foo","@id":"\/foos","@type":"hydra:Collection","hydra:member":[],"hydra:totalItems":0}

varnish don't add validation header 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jderusse You're right. I overlooked that FOSHttpCacheBundle was the one adding it. My apologies...

I think we should use the same approach. Just use the MD5 hash of the Response content as ETag. MD5 is cheap, so no reason to use a meaningless Last-Modified. 😄

Copy link
Member Author

@dunglas dunglas May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a md5 of the response is not ok (the response contains the date). We may do a md5 of the response's body.

{
$loader->load('http_cache.xml');

if (true !== $config['http_cache']['enable_tags']) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enable_tags is true and maxAge > 0. We should warn the developer that the browser will cache responses and remove the benefits of cache invalidation.

Copy link
Member Author

@dunglas dunglas May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be intended (I often do that). It allows to always serve fresh data for the first request but reduce the server load/bandwidth usage thanks to expiration. It's often not a problem when the TTL is low. I would let the user do what he wants without warning.

return;
}

$resources = $request->attributes->get('_resources', []);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tight couple the purge system with http requests context which is used like a databag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll try to find a better solution but I'm not sure there is one. Anyway its called purge HTTP cache, it's not a big deal to couple it with the HTTP request.

@dunglas
Copy link
Member Author

dunglas commented May 15, 2017

All comments handled. Can you make a last review?


$rawData = parent::normalize($object, $format, $context);
if (!is_array($rawData)) {
return $rawData;
}

$data = ['_links' => ['self' => ['href' => $this->iriConverter->getIriFromItem($object)]]];
$data = ['_links' => ['self' => ['href' => $context['iri']]]];
$context['debug'] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this property in the context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it has nothing to do here!

@@ -0,0 +1,64 @@
Feature: Cache invalidation trough HTTP Cache tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

through?

!$request->isMethodCacheable()
|| !$response->isCacheable()
|| (!$attributes = RequestAttributesExtractor::extractAttributes($request))
|| !$resources = $request->attributes->get('_resources')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no parenthesis here, but the line above yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parenthesis on the previous line are mandatory because of operator priorities.

@@ -94,7 +95,7 @@ public function testNormalize()
);
$normalizer->setSerializer($serializerProphecy->reveal());

$this->assertEquals(['name' => 'hello'], $normalizer->normalize($dummy));
$this->assertEquals(['name' => 'hello'], $normalizer->normalize($dummy, null, ['resources' => []]));
Copy link
Member

@soyuka soyuka May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we avoid changing those tests? It feels like the tests (this one and the others) won't pass without 2 more arguments (eg, null, ['resources' => []]). Is it the case / Isn't this breaking things somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passes without the arguments and without change, but I've modified it to test the new behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay as long as there is a test that doesn't adds up those arguments.

@Simperfit
Copy link
Contributor

I'm going to test this with a project, ill tell if there are some bugs

@dunglas dunglas force-pushed the cache_tags branch 4 times, most recently from 11dd011 to 6075fa2 Compare May 22, 2017 15:58
@QuingKhaos
Copy link

May you use http://httplug.io/ instead of guzzle directly?

@dunglas dunglas force-pushed the cache_tags branch 2 times, most recently from d044bac to 17443aa Compare May 23, 2017 09:57
@teohhanhui
Copy link
Contributor

May you use http://httplug.io/ instead of guzzle directly?

It has been raised before, and I'd like to once again echo this.

@dunglas
Copy link
Member Author

dunglas commented May 23, 2017

HTTPPlug introduces a lot of complexity (including an extra bundle to configure... until we get Flex) for no gain here. I'm still 👎 for now.
I understand the problem with Guzzle 3, but API Platform is younger than Guzzle 3, every bundle I use are compatible with Guzzle 6, and AFAIK, there is no projet to develop Guzzle 7 soon.

@dunglas
Copy link
Member Author

dunglas commented May 23, 2017

And by the way, Guzzle is a soft dependency here. So someone wanting to use another client can do it, he just have to implement by himself the PurgerInterface, not a big deal.

@theofidry
Copy link
Contributor

As long as Guzzle is a soft dependency it's ok

@dunglas dunglas merged commit 5a81357 into api-platform:master May 23, 2017
@dunglas dunglas deleted the cache_tags branch May 23, 2017 15:57
hoangnd25 pushed a commit to hoangnd25/core that referenced this pull request Feb 23, 2018
Builtin cache invalidation system aka make API Platform fast as hell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.