-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unexpected route registration/unregistration messages during cf redeployment #582
Comments
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story. The labels on this github issue will be updated when the story is started. |
Here's a quick update on my progress. It turned out that the way I had recreated the issue initially was a bit incorrect since I was using a single I also intercepted all of the events received during an update within diego-api (pastebin). As it could be seen there is a burst of I then inspected the event objects thoroughly and it turns out that there is a mismatch between ActualLRPNetInfo. This is an example intercepted event through {
"type": "actual_lrp_changed",
"data": {
"before": {
"instance": {
"process_guid": "bcadd3da-6307-4d12-93db-151c2ebf44f9-f526171a-a3af-41e9-b441-39f40a468a42",
"index": 3,
"instance_guid": "15a58566-f0f1-494c-5de4-679a",
"cell_id": "593fc7d4-f523-4b01-9219-cc28eb5ffdd9",
"address": "10.0.73.9",
"instance_address": "10.137.114.169",
"preferred_address": "HOST",
"crash_count": 0,
"state": "RUNNING",
"presence": "ORDINARY",
...
}
},
"after": {
"instance": {
"process_guid": "bcadd3da-6307-4d12-93db-151c2ebf44f9-f526171a-a3af-41e9-b441-39f40a468a42",
"index": 3,
"instance_guid": "15a58566-f0f1-494c-5de4-679a",
"cell_id": "593fc7d4-f523-4b01-9219-cc28eb5ffdd9",
"address": "10.0.137.3",
"instance_address": "10.130.191.17",
"preferred_address": "HOST",
"crash_count": 0,
"state": "RUNNING",
"presence": "ORDINARY",
...
}
}
}
} As it could be observed the I then inserted debug logs and empirically proved that the state changed events are emitted through the ActualLRPEventCalculator.EmitEvents function and in particular through here (for new events) and here (for deprecated events). I additionally traced that these changed events happen via calls to the actual_lrp_db.StartActualLRP function which:
If changes have been noticed an event to the respective route-emitter is emitted. Also every single time there is a changed event the only difference is within the Sooo my brain makes me think that it is due to functional dependency anomalies since it turns out that |
First to reflect the worries of @aminjam regarding the fact that the issue occurs not only when bbs encryption key is being rotated. I totally forgot about that. 😄🔫 I’ve seen it happen only once when there is no bbs encryption key rotation but with much weaker symptoms. It was after a regular
You can see the small burst in the number of registered routes after we switch to the updated And here we have an actual occurrence from the same environment in the same day when the bbs encryption key had been rotated. I think that it is much more obvious because the spikes reach up to values of 15-20. So most probably the spikes from 12:50 are something expected whilst the ones from 3:18 obviously aren't. And I’ve been initially confused by the small burst at 12:50 to state that the bug actually occurred then. Sadly all of the logs are now lost and I never had the chance to play around with them. Secondly the RC lies within the reEncrypt function since it relies on a single attribute as the primary key of the table but a primary key could consist of multiple attributes. Such is the case with the
On the other hand the reEncrypt function is called only with the process_guid as a primary key leaving us with wrong Third but not least I've made a PoC as part of my debugging exercise that fixes the problem. The commit is a bit polluted by debug logs but the fix is there (in the |
I've found an easier way to recreate the problem: mkdir simple
pushd simple &> /dev/null
echo "Hi, Diego!" > index.html
cf push simple -b staticfile_buildpack
cf scale simple -i 250
popd &> /dev/null
# Provision green BBS encryption key.
# Rotate the BBS encryption key by redeploying CF and applying the green one.
# Observe how the route_emitter.RoutesRegistered and route_emitter.RoutesUnregistered metrics burst
# after the BBS on the diego-api instance with the rotated key becomes active. I also found an issue with my fix PoC. Will try to fix it ASAP. |
I fixed the issue with my PoC. I was having issues with db blob scanning. Hence I had to use a bit of reflection to make the function more generic. |
@IvanHristov98 thank you for reporting the issue and digging into this. You are correct that reEncrypt potentially incorrectly overwrites netinfo for some records because it only updates based on process_guid. So if there is a different |
Thanks for your feedback @mariash! I have a few questions in mind though:
|
actual_lrps table has primary_key that consists of multiple fields - process_guid, instance_index and presence. By only using the process_guid when running updates BBS incorrectly encodes actual LRPs that have the same process_guid but different instance_index. This results in route-emitter sending incorrect information to gorouter and eventually can cause apps downtime during deploys that involve encryption key updates. Issue [#582](cloudfoundry/diego-release#582)
Hi Ivan I just pushed a fix to BBS - cloudfoundry/bbs@6cc24b6 I could not reproduce it, so I added a unit test that exposes this. I was trying to avoid using the reflect in a fix. Thank you so much for reporting this! Hopefully this will fix your deployments. Please let us know if the issue persists. |
Fix unexpected route registration/unregistration messages during cf redeployment
Summary
On our live environments when
diego-api
is updated during abosh deploy ...
some apps experience503 Service Unavailable
for a brief time period.Chronology of events:
bosh deploy ...
with the regular sprint update. (all is fine for now)diego-api (0)
with the activebbs
instance shuts down. (all is fine for now)diego-api (0)
starts anddiego-api (1)
is shut down bybosh
. (all is fine for now)bbs
fromdiego-api (0)
becomes the activebbs
instance. (and all turns on fire 🔥)503 Service Unavailable
for 1-7 minutes.Errors in
gorouter
:Hence
route-emitter
should be involved which could actually be observed from the metrics below (step 4 was performed around 2:30):Such events in
route-emitter
occur only when thebbs
hub module sends events toroute-emitter
with ActualLRP changes (code).On some environments we see the following errors within
bbs
whenever the issue occurs:On others we see:
Expectation:
Such "bursting" route registration/unregistration behaviour should be observed only during
diego-cell
updates and shouldn't lead to app downtime. In our case onlydiego-api
is updated separately from thediego-cells
which is why I would classify it as a strange behaviour.Steps to Reproduce
Not clear yet but a good starting point is:
GIVEN a foundation with at least 10
diego-cells
each with ~50 containers,WHEN you redeploy
cf
withbosh
anddiego-api
gets updated,THEN
route_emitter.RoutesRegistered
androute_emitter.RoutesUnregistered
metrics rapidly increase their values for a short period of time on almost alldiego-cells
resulting inx509: certificate is valid for <app-guid-1>, not <app-guid-2>
errors ingorouter
.💡 I'll work over finding a better way to reproduce it.
💡 I'm also not sure whether it happens during each failover of
bbs
.Diego repo
My suspicions are that the issue is coming from the:
Environment Details
The issue can be seen amongst environments ranging from 2k to 60k app instances.
We're running
diego-release v2.50.0
androuting-release v0.216.0
.We've been observing the issue for around 1-2 months already.
Last observed with this
cf-deployment
version.Possible Causes or Fixes (optional)
I thought it was due to bbs encryption key rotation because there were good correlations between
bbs.EncryptionDuration
and the duration of the outage. Anyways I saw it happen without a key rotation so it shouldn't be it.I have also only seen it happen during regular updates and not when diego-api fails over.
My guts tell me the new active
bbs
instance performs a rollback of events because it had been outdated. Also I thinklocket
might be involved in all of this since I see a lot of errors related it.The text was updated successfully, but these errors were encountered: