Fix unexpected route registration/unregistration messages during cf redeployment #582

IvanHristov98 · 2021-08-04T12:40:03Z

Fix unexpected route registration/unregistration messages during cf redeployment

Summary

On our live environments when diego-api is updated during a bosh deploy ... some apps experience 503 Service Unavailable for a brief time period.

Chronology of events:

Run bosh deploy ... with the regular sprint update. (all is fine for now)
The diego-api (0) with the active bbs instance shuts down. (all is fine for now)
diego-api (0) starts and diego-api (1) is shut down by bosh. (all is fine for now)
bbs from diego-api (0) becomes the active bbs instance. (and all turns on fire 🔥)
Some apps experience 503 Service Unavailable for 1-7 minutes.
All gets fixed by itself. ✅

Errors in gorouter:

"error":"x509: certificate is valid for <app-guid-1>, not <app-guid-2>"

Hence route-emitter should be involved which could actually be observed from the metrics below (step 4 was performed around 2:30):

Such events in route-emitter occur only when the bbs hub module sends events to route-emitter with ActualLRP changes (code).

On some environments we see the following errors within bbs whenever the issue occurs:

"message": "bbs.request.start-actual-lrp.db-start-actual-lrp.failed-to-transition-actual-lrp-to-started",

On others we see:

{
    "timestamp": "2021-07-29T02:33:22.404279874Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.got-error-sending-event",
    "data": {
        "error": "slow consumer"
    }
}

{
    "timestamp": "2021-07-29T02:30:55.825459239Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.request.subscribe-r0.failed-to-get-next-event",
    "data": {
        "error": "read from closed source",
        "session": "1827.1"
    }
}

# Also this one is seen but it is due to encryption key rotation and I guess it is nothing to worry about.
{
    "timestamp": "2021-07-29T02:35:37.111491890Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.encryptor.failed-to-scan-blob",
    "data": {
        "blob_columns": [
            "run_info",
            "volume_placement",
            "routes"
        ],
        "desired-key-label": "key1_2021T07a_0",
        "error": "sql: no rows in result set",
        "existing-key-label": "key1_2021T06b_0",
        "primary_key": "process_guid",
        "session": "7",
        "table_name": "desired_lrps"
    }
}

Expectation:

Such "bursting" route registration/unregistration behaviour should be observed only during diego-cell updates and shouldn't lead to app downtime. In our case only diego-api is updated separately from the diego-cells which is why I would classify it as a strange behaviour.

Steps to Reproduce

Not clear yet but a good starting point is:

GIVEN a foundation with at least 10 diego-cells each with ~50 containers,
WHEN you redeploy cf with bosh and diego-api gets updated,
THEN route_emitter.RoutesRegistered and route_emitter.RoutesUnregistered metrics rapidly increase their values for a short period of time on almost all diego-cells resulting in x509: certificate is valid for <app-guid-1>, not <app-guid-2> errors in gorouter.

💡 I'll work over finding a better way to reproduce it.
💡 I'm also not sure whether it happens during each failover of bbs.

Diego repo

My suspicions are that the issue is coming from the:

Environment Details

The issue can be seen amongst environments ranging from 2k to 60k app instances.
We're running diego-release v2.50.0 and routing-release v0.216.0.
We've been observing the issue for around 1-2 months already.
Last observed with this cf-deployment version.

Possible Causes or Fixes (optional)

I thought it was due to bbs encryption key rotation because there were good correlations between bbs.EncryptionDuration and the duration of the outage. Anyways I saw it happen without a key rotation so it shouldn't be it.

I have also only seen it happen during regular updates and not when diego-api fails over.

My guts tell me the new active bbs instance performs a rollback of events because it had been outdated. Also I think locket might be involved in all of this since I see a lot of errors related it.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2021-08-04T12:40:05Z

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

IvanHristov98 · 2021-08-12T15:48:18Z

Here's a quick update on my progress.

It turned out that the way I had recreated the issue initially was a bit incorrect since I was using a single diego-api instance leading to locket downtime. The good news is that I'm able to recreate it with the same scenario as above but this time using 2 x diego-api instances. Also I rotate the bbs_encryption_key every time.

I also intercepted all of the events received during an update within diego-api (pastebin). As it could be seen there is a burst of actual_lrp_instance_changed and actual_lrp_changed events. This behaviour is totally unexpected.

I then inspected the event objects thoroughly and it turns out that there is a mismatch between ActualLRPNetInfo. This is an example intercepted event through cfdot:

{
    "type": "actual_lrp_changed",
    "data": {
        "before": {
            "instance": {
                "process_guid": "bcadd3da-6307-4d12-93db-151c2ebf44f9-f526171a-a3af-41e9-b441-39f40a468a42",
                "index": 3,
                "instance_guid": "15a58566-f0f1-494c-5de4-679a",
                "cell_id": "593fc7d4-f523-4b01-9219-cc28eb5ffdd9",
                "address": "10.0.73.9",
                "instance_address": "10.137.114.169",
                "preferred_address": "HOST",
                "crash_count": 0,
                "state": "RUNNING",
                "presence": "ORDINARY",
                ...
            }
        },
        "after": {
            "instance": {
                "process_guid": "bcadd3da-6307-4d12-93db-151c2ebf44f9-f526171a-a3af-41e9-b441-39f40a468a42",
                "index": 3,
                "instance_guid": "15a58566-f0f1-494c-5de4-679a",
                "cell_id": "593fc7d4-f523-4b01-9219-cc28eb5ffdd9",
                "address": "10.0.137.3",
                "instance_address": "10.130.191.17",
                "preferred_address": "HOST",
                "crash_count": 0,
                "state": "RUNNING",
                "presence": "ORDINARY",
                ...
            }
        }
    }
}

As it could be observed the address and instance_address fields have changed. The ones from the before section are incorrect and the ones from the after section are correct.

I then inserted debug logs and empirically proved that the state changed events are emitted through the ActualLRPEventCalculator.EmitEvents function and in particular through here (for new events) and here (for deprecated events).

I additionally traced that these changed events happen via calls to the actual_lrp_db.StartActualLRP function which:

fetches the current info from the database.
checks for any changes in the actual-lrp data.
updates the db if changes have been noticed.

If changes have been noticed an event to the respective route-emitter is emitted.

Also every single time there is a changed event the only difference is within the ActualLRPNetInfo (at least according to my debug logs). This aligns perfectly with having the event occur only during database encryption key rotation since it reads the records as blobs and then possibly re-encrypts them leaving the db in an incorrect state. The good thing is that bbs is smart enough to restore its db to a correct state.

Sooo my brain makes me think that it is due to functional dependency anomalies since it turns out that bbs_db is breaking the first normal form by having nested records when inserting/reading ActualLRPNetInfo objects as blobs.

IvanHristov98 · 2021-08-13T16:00:12Z

First to reflect the worries of @aminjam regarding the fact that the issue occurs not only when bbs encryption key is being rotated. I totally forgot about that. 😄🔫 I’ve seen it happen only once when there is no bbs encryption key rotation but with much weaker symptoms. It was after a regular diego-api redeploy:

12:48:13 | Updating instance diego-api: diego-api/fe97535c-a45b-4768-a296-ad486c2257fc (0) (canary)
... trimmed for readability ...
12:49:09 | Updating instance diego-api: diego-api/fe97535c-a45b-4768-a296-ad486c2257fc (0) (canary) (00:00:56)
12:49:09 | Updating instance diego-api: diego-api/a44ca6fa-7882-45f1-801e-415595dedaa5 (1)
... trimmed for readability ...
12:50:09 | Updating instance diego-api: diego-api/a44ca6fa-7882-45f1-801e-415595dedaa5 (1)
12:50:09 | L executing post-start: diego-api/a44ca6fa-7882-45f1-801e-415595dedaa5 (1) (00:01:01)

You can see the small burst in the number of registered routes after we switch to the updated diego-api. It is interesting to note out that the spikes reach up only to a value of 1 or 2 (a really small amount) but there is a burst on one third of the route-emitters.

And here we have an actual occurrence from the same environment in the same day when the bbs encryption key had been rotated. I think that it is much more obvious because the spikes reach up to values of 15-20.

So most probably the spikes from 12:50 are something expected whilst the ones from 3:18 obviously aren't. And I’ve been initially confused by the small burst at 12:50 to state that the bug actually occurred then. Sadly all of the logs are now lost and I never had the chance to play around with them.

Secondly the RC lies within the reEncrypt function since it relies on a single attribute as the primary key of the table but a primary key could consist of multiple attributes. Such is the case with the actual_lrps table as listed by postgres:

"actual_lrps_pkey" PRIMARY KEY, btree (process_guid, instance_index, presence)

On the other hand the reEncrypt function is called only with the process_guid as a primary key leaving us with wrong actual_lrps records.

Third but not least I've made a PoC as part of my debugging exercise that fixes the problem. The commit is a bit polluted by debug logs but the fix is there (in the db/sqldb/encryption_db.go module) and I've tested it to work on my dev environment.

IvanHristov98 · 2021-08-16T12:04:08Z

I've found an easier way to recreate the problem:

mkdir simple
pushd simple &> /dev/null
    echo "Hi, Diego!" > index.html
    cf push simple -b staticfile_buildpack
    cf scale simple -i 250
popd &> /dev/null

# Provision green BBS encryption key.
# Rotate the BBS encryption key by redeploying CF and applying the green one.

# Observe how the route_emitter.RoutesRegistered and route_emitter.RoutesUnregistered metrics burst 
# after the BBS on the diego-api instance with the rotated key becomes active.

I also found an issue with my fix PoC. Will try to fix it ASAP.

IvanHristov98 · 2021-08-18T08:42:07Z

I fixed the issue with my PoC. I was having issues with db blob scanning. Hence I had to use a bit of reflection to make the function more generic.
fix commit - IvanHristov98/bbs@50079b2
fix branch - https://github.com/IvanHristov98/bbs/tree/fix-actual-lrp-re-encryption

mariash · 2021-08-27T21:13:39Z

@IvanHristov98 thank you for reporting the issue and digging into this. You are correct that reEncrypt potentially incorrectly overwrites netinfo for some records because it only updates based on process_guid. So if there is a different instance_index all the records will be updated in a loop with incorrect data. I added a test that exposes this. So during an upgrade there might be a temporary issue where actual LRP will be incorrect for a while until BBS re-converges it back.

IvanHristov98 · 2021-08-31T05:38:52Z

Thanks for your feedback @mariash! I have a few questions in mind though:

I added a test that exposes this.

What kind of test and where?
Have you began working over a bug fix? If not then my team could provide it. My point here is not to double the effort by working simultaneously over the fix.

actual_lrps table has primary_key that consists of multiple fields - process_guid, instance_index and presence. By only using the process_guid when running updates BBS incorrectly encodes actual LRPs that have the same process_guid but different instance_index. This results in route-emitter sending incorrect information to gorouter and eventually can cause apps downtime during deploys that involve encryption key updates. Issue [#582](cloudfoundry/diego-release#582)

mariash · 2021-08-31T23:25:40Z

Hi Ivan I just pushed a fix to BBS - cloudfoundry/bbs@6cc24b6

I could not reproduce it, so I added a unit test that exposes this. I was trying to avoid using the reflect in a fix.

Thank you so much for reporting this! Hopefully this will fix your deployments. Please let us know if the issue persists.

IvanHristov98 added the bug label Aug 4, 2021

cf-gitbot added the unscheduled label Aug 4, 2021

mariash closed this as completed Aug 31, 2021

cf-gitbot removed the unscheduled label Aug 31, 2021

mariash mentioned this issue Sep 29, 2022

Add Maria Shaldybin as Diego approver cloudfoundry/community#434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unexpected route registration/unregistration messages during cf redeployment #582

Fix unexpected route registration/unregistration messages during cf redeployment #582

IvanHristov98 commented Aug 4, 2021

cf-gitbot commented Aug 4, 2021

IvanHristov98 commented Aug 12, 2021

IvanHristov98 commented Aug 13, 2021

IvanHristov98 commented Aug 16, 2021

IvanHristov98 commented Aug 18, 2021

mariash commented Aug 27, 2021

IvanHristov98 commented Aug 31, 2021

mariash commented Aug 31, 2021

Fix unexpected route registration/unregistration messages during cf redeployment #582

Fix unexpected route registration/unregistration messages during cf redeployment #582

Comments

IvanHristov98 commented Aug 4, 2021