-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SO Migration] fix reindex race on multi-instance mode #104516
[SO Migration] fix reindex race on multi-instance mode #104516
Conversation
const errors = (res.body.items ?? []).filter( | ||
(item) => item.index?.error?.type !== 'version_conflict_engine_exception' | ||
); | ||
const errors = (res.body.items ?? []) | ||
.filter((item) => item.index?.error) | ||
.map((item) => item.index!.error!) | ||
.filter(({ type }) => type !== 'version_conflict_engine_exception'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even sure how this was working previously with non-error items, as
>>({}).index?.error?.type !== 'version_conflict_engine_exception'
<< true
but In doubt, I added more steps for the sake of readability.
if (errors.every(isWriteBlockException)) { | ||
return Either.left({ | ||
type: 'target_index_had_write_block' as const, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very likely that if any write_block exception is encountered, all the objects encountered it, but just in case another error was returned, we check that all errors are effectively write block exceptions.
8. Reindex the source index into the new temporary index using a 'client-side' reindex, by reading batches of documents from the source, migrating them, and indexing them into the temp index. | ||
1. Use `op_type=index` so that multiple instances can perform the reindex in parallel (last node running will override the documents, with no effect as the input data is the same) | ||
2. Ignore `version_conflict_engine_exception` exceptions as they just mean that another node was indexing the same documents | ||
3. If a `target_index_had_write_block` exception is encountered for all document of a batch, assume that another node already completed the temporary index reindex, and jump to the next step | ||
4. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents (without writing them to the temp index). If any failures occured, log the complete list of documents that failed to transform, then fail the migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We forgot to update the RFC for the client-side reindex. Fixed it, and added the new target_index_had_write_block
special case,
export const isWriteBlockException = ({ type, reason }: EsErrorCause): boolean => { | ||
return ( | ||
type === 'cluster_block_exception' && | ||
reason.match(/index \[.+] blocked by: \[FORBIDDEN\/8\/.+ \(api\)\]/) !== null | ||
); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the type of operation, the reason identifying a write block can vary
e.g
index [.kibana_dolly] blocked by: [FORBIDDEN/8/index write (api)]
index [.kibana_dolly] blocked by: [FORBIDDEN/8/moving to block index write (api)]
I extracted the function previously present in wait_for_reindex_task.ts
and made it more generic to match any FORBIDDEN/8/*** (api)
text.
).resolves.toMatchInlineSnapshot(` | ||
Object { | ||
"_tag": "Left", | ||
"left": Object { | ||
"type": "target_index_had_write_block", | ||
}, | ||
} | ||
`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fan of snapshots to test resolved results (especially as the resolved object is small), but this is what is done in the other tests of the file, and I didn't want to fix them all in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for all the tests
kibana/src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts
Lines 134 to 144 in 5cbb075
expect(res.right).toEqual( | |
expect.objectContaining({ | |
existing_index_with_docs: { | |
aliases: {}, | |
mappings: expect.anything(), | |
settings: expect.anything(), | |
}, | |
}) | |
); | |
}); | |
}); |
it('rejects if there are errors', async () => { | ||
it('resolves left if there are write_block errors', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to also add a IT test to assert that it still rejects for other errors, but it seems that using an non-existing index surprisingly leads to a timeout (which is handled by catchRetryableEsClientErrors
) instead of a more final error, and I couldn't find a way to trigger another kind of error from ES.
If anyone has an idea, I'll take it. Else it's probably fine as it's covered in unit tests anyway.
it('migrates saved objects normally when multiple Kibana instances are started at the same time', async () => { | ||
const setupContracts = await Promise.all([rootA.setup(), rootB.setup(), rootC.setup()]); | ||
|
||
setupContracts.forEach((setup) => setup.savedObjects.registerType(fooType)); | ||
|
||
await startWithDelay([rootA, rootB, rootC], 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests with 0, 1, 5, 20sec
Pinging @elastic/kibana-core (Team:Core) |
}; | ||
|
||
afterAll(async () => { | ||
await new Promise((resolve) => setTimeout(resolve, 10000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a mistic delay wandering through all the files 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea... I guess I did as everyone else, wondering what that was for, and then in doubt, still copied it 😄 . Do you know if it's safe to delete those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I'd test it and remove it for the tests at once
src/core/server/saved_objects/migrationsv2/integration_tests/multiple_kibana_nodes.test.ts
Outdated
Show resolved
Hide resolved
); | ||
const errors = (res.body.items ?? []) | ||
.filter((item) => item.index?.error) | ||
.map((item) => item.index!.error!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow! so many ?
, !
😅 I'd write it as
.map((item) => item.index?.error)
.filter(Boolean)
.filter(({ type }) => type !== 'version_conflict_engine_exception');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the problem is that TS is stupid with map/filter.
.map((item) => item.index?.error)
.filter(Boolean)
Is not sufficient to have the | undefined
part removed from error
. The third line complains that ErrorContainer | undefined
does not have a type
property, which is why I kinda was forced to filter first then force-cast using !
describe('isWriteBlockError', () => { | ||
it('returns true for a `index write` cluster_block_exception', () => { | ||
expect( | ||
isWriteBlockException({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add an integration test instead? Since it's easy to reproduce for an ES instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already an integration test for the action using the function, but it doesn't hurt to do it for the helper itself.
).resolves.toMatchInlineSnapshot(` | ||
Object { | ||
"_tag": "Left", | ||
"left": Object { | ||
"type": "target_index_had_write_block", | ||
}, | ||
} | ||
`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for all the tests
kibana/src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts
Lines 134 to 144 in 5cbb075
expect(res.right).toEqual( | |
expect.objectContaining({ | |
existing_index_with_docs: { | |
aliases: {}, | |
mappings: expect.anything(), | |
settings: expect.anything(), | |
}, | |
}) | |
); | |
}); | |
}); |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
* fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md
* fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md
… (#104761) * [SO Migration] fix reindex race on multi-instance mode (#104516) * fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md * fix dataset for 7.14
#104760) * [SO Migration] fix reindex race on multi-instance mode (#104516) * fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md * fix dataset for 7.15
…-of-max-results * 'master' of github.com:elastic/kibana: (36 commits) Lower Kibana app bundle limits (elastic#104688) [Security Solutions] Fixes bug with the filter query compatibility for transforms (elastic#104559) [RAC] Add mapping update logic to RuleDataClient (elastic#102586) Fix import workpad (elastic#104722) [canvas] Fix Storybook service decorator (elastic#104750) [Detection Rules] Add 7.14 rules (elastic#104772) [Enterprise Search] Fix beta notification in sidebar (elastic#104763) Fix engine routes that are meta engine or non-meta-engine specific (elastic#104757) [Fleet] Fix policy revision number getting bumped for no reason (elastic#104696) persistable state migrations (elastic#103680) [Fleet] Fix add agent in the package policy table (elastic#104749) [DOCS] Creates separate doc for security in production (elastic#103973) [SO Migration] fix reindex race on multi-instance mode (elastic#104516) [Security Solution] Update text in Endpoint Admin pages (elastic#104649) [package testing] Decrease timeout to 2 hours (elastic#104668) Fix background styling of waterfall chart sidebar tooltip. (elastic#103997) [Fleet + Integrations UI] Integrations UI Cleanup (elastic#104641) [Fleet] Link to download page of current stack version on Agent install instructions (elastic#104494) [Workplace Search] Fix Media Type field preview is unformatted bug (elastic#104684) [ML] add marker body (elastic#104672) ... # Conflicts: # x-pack/plugins/fleet/public/search_provider.test.ts
Summary
Fix #99211
Integration test inspired from #100171
When in multi-instance mode, the leading node adds the write-block on the temp index as soon as it finishes the source->temp reindex. This was potentially causing the trailing nodes to fail if they were still performing the same reindex, as the temp index was now write-blocked. If this had theoretically no impact on the migration (as long as the leading node successfully completed the next steps), it was causing the trailing nodes to fail, requiring a restart (and also outputting migration failure logs when the migration was probably correctly completed).
This PR fixes it by having the
bulkOverwriteTransformedDocuments
action identifycluster_block_exception
failures, and returning a specificleft
value in such case (instead of throwing), to allow the model to directly jump to the next step instead of failing.Checklist