-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-7000: Node ID/UUID Override #22972
CORE-7000: Node ID/UUID Override #22972
Conversation
/ci-repeat 1 |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53293#019175d9-8ebf-41ec-acd5-31f4de51e6dd ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53720#01919bb1-fba1-44bd-86a9-7052d5c39c65 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53839#0191a50e-cce5-401b-b716-7e5fcc09da61 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54169#0191d3a1-b4c7-4d85-ac8e-5eaffffd80d6 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54331#0191e23c-3795-498f-bea1-c8d5609976d3 |
6b16c80
to
9d48a3f
Compare
/ci-repeat 1 |
@michael-redpanda - leaving this in draft pending final RFC signoff, but this is probably worth a look whenever you're ready 🙏 |
9d48a3f
to
1f9aff4
Compare
/ci-repeat 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really nice so far!
1f9aff4
to
e0bea3a
Compare
force push contents:
|
/ci-repeat 1 |
e0bea3a
to
0e8c804
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from docs
9474132
to
b4656fd
Compare
force push contents:
|
} | ||
static bool decode(const Node& node, type& rhs) { | ||
auto value = node.as<std::string>(); | ||
auto out = [&value]() -> std::optional<model::node_uuid> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any reason for the "construct lambda then immediately call" pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personal preference. "initialize this optional with the result of from_string
or, if it throws, nullptr" feels a bit neater than "initialize this optional to nullptr then assign the result of from_string
, unless it throws". The ID in the outer scope receives a value at exactly one code point.
@@ -2622,6 +2663,18 @@ void application::wire_up_and_start(::stop_signal& app_signal, bool test_mode) { | |||
"Running with already-established node ID {}", | |||
config::node().node_id()); | |||
node_id = config::node().node_id().value(); | |||
} else if (auto id = _node_overrides.node_id(); id.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to try to register with the cluster (next conditional branch) in this case as well. For two reasons:
- it will get us fresh features and cluster config snapshots from the controller leader
- it will allow us to reject erroneous configurations (e.g. if the UUID is already registered with a different id).
Not sure if all the code for this is there, but looks like at least the join RPC supports passing existing node ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll refactor the conditionals a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we hit the problem of needing a controller leader if we try to register here? Or do we also want to change members_manager::handle_join_request
to not require controller leadership when the node is trying to register with a known (node uuid, node id) pair?
redpanda/src/v/cluster/members_manager.cc
Lines 1240 to 1298 in 74223be
members_manager::handle_join_request(join_node_request const req) { | |
using ret_t = result<join_node_reply>; | |
using status_t = join_node_reply::status_code; | |
bool node_id_assignment_supported = _feature_table.local().is_active( | |
features::feature::node_id_assignment); | |
bool req_has_node_uuid = !req.node_uuid.empty(); | |
if (node_id_assignment_supported && !req_has_node_uuid) { | |
vlog( | |
clusterlog.warn, | |
"Invalid join request for node ID {}, node UUID is required", | |
req.node.id()); | |
co_return errc::invalid_request; | |
} | |
std::optional<model::node_id> req_node_id = std::nullopt; | |
if (req.node.id() >= 0) { | |
req_node_id = req.node.id(); | |
} | |
if (!node_id_assignment_supported && !req_node_id) { | |
vlog( | |
clusterlog.warn, | |
"Got request to assign node ID, but feature not active", | |
req.node.id()); | |
co_return errc::invalid_request; | |
} | |
if ( | |
req_has_node_uuid | |
&& req.node_uuid.size() != model::node_uuid::type::length) { | |
vlog( | |
clusterlog.warn, | |
"Invalid join request, expected node UUID or empty; got {}-byte " | |
"value", | |
req.node_uuid.size()); | |
co_return errc::invalid_request; | |
} | |
model::node_uuid node_uuid; | |
if (!req_node_id && !req_has_node_uuid) { | |
vlog(clusterlog.warn, "Node ID assignment attempt had no node UUID"); | |
co_return errc::invalid_request; | |
} | |
ss::sstring node_uuid_str = "no node_uuid"; | |
if (req_has_node_uuid) { | |
node_uuid = model::node_uuid(uuid_t(req.node_uuid)); | |
node_uuid_str = ssx::sformat("{}", node_uuid); | |
} | |
vlog( | |
clusterlog.info, | |
"Processing node '{} ({})' join request (version {}-{})", | |
req.node.id(), | |
node_uuid_str, | |
req.earliest_logical_version, | |
req.latest_logical_version); | |
if (!_raft0->is_elected_leader()) { | |
vlog(clusterlog.debug, "Not the leader; dispatching to leader node"); | |
// Current node is not the leader have to send an RPC to leader | |
// controller | |
co_return co_await dispatch_rpc_to_leader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good point, maybe it won't work then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the exact requirement we're trying to circumvent. i was wondering whether the cluster layer might short circuit somewhere (or be made to short circuit) on previously known <ID,UUID>, but I suppose all the request routing is controller leadership based 🤷
self.logger.debug( | ||
f"...and decommission ghost node [{ghost_node_id}]...") | ||
|
||
self.admin.decommission_broker(ghost_node_id, node=to_stop[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this won't actually wait for the decom to be successful.
|
||
if mode == TestMode.CFG_OVERRIDE: | ||
self.redpanda.restart_nodes( | ||
to_stop, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we are already kind of testing that the nodes will only adopt overrides that match the current uuid, but maybe it will be more realistic to do what k8s will do and perform a full rolling restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm beginning to think that "rolling restart" (I used this language in the RFC and the runbook) is not quite accurate. In DT, for example, we have a RollingRestarter
that a) uses maintenance mode by default b) requires the cluster to be healthy before cycling each node. Presumably a k8s rolling restart looks somewhat similar.
Of course we can't meet either of those in this usage case. In DT we can fudge it with an unsafe
param or something, but we may want to reconsider this framing with respect to support-facing docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worse than that, I think - in this test case, I believe a rolling restart is straightforwardly impossible, since restarting one node (with overrides) gives us a total of 2 live brokers - not enough for to form a cluster due to the presence of ghost nodes.
My intuition is that we can start the nodes concurrently (as written) as long as the number of nodes we're restarting is <= (n_nodes + n_ghosts) / 2
. What matters is that the empty nodes don't form a cluster among themselves, independent of any nodes containing a complete controller log, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense to me. We want to restart the minimum number of ghost nodes to form a healthy cluster and wait for them to form a healthy controller quorum before continuing to restart the remaining nodes.
It makes sense to call out in the support docs not to use decommissioning or maintenance mode during any of this.
Perhaps the most important thing here is to not restart the healthy nodes to make sure they are available throughout the whole process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most important thing here is to not restart the healthy nodes
yest exactly. in fact, if we restart those nodes, I think they will get stuck just the same.
call out in the support docs not to use decommissioning or maintenance mode
Yup, and that they are unavailable.
Generally, I think we're back to the point where we have a stack of assumptions that look right but would benefit from a ✅ from Alexey or @mmaslankaprv when he becomes available
To enable boost::lexical_cast for program_options parsing and UUIDs in configs. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- config::node_id_override - config::node_override_store Includes json/yaml SerDes and unit tests Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- node_id_overrides: std::vector<config::node_id_override> Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
b4656fd
to
0ed66ce
Compare
force push CR changes, mostly hardening tests |
|
||
if mode == TestMode.CFG_OVERRIDE: | ||
self.redpanda.restart_nodes( | ||
to_stop, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense to me. We want to restart the minimum number of ghost nodes to form a healthy cluster and wait for them to form a healthy controller quorum before continuing to restart the remaining nodes.
It makes sense to call out in the support docs not to use decommissioning or maintenance mode during any of this.
Perhaps the most important thing here is to not restart the healthy nodes to make sure they are available throughout the whole process.
0ed66ce
to
2b81be0
Compare
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
And wire them into the corresponding node configs. "--node-id-overrides uuid:uuid:id [uuid:uuid:id ...]" Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
and 'restart_nodes' Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
2b81be0
to
ace0d61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
brokers = self.admin.get_brokers(node=n) | ||
for b in brokers: | ||
if b['node_id'] == node_id: | ||
results.append(b['membership_status'] != 'active') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it is better to wait until the node fully disappears ("draining" membership status means that the decom just started).
/backport v24.2.x |
/backport v24.1.x |
Failed to create a backport PR to v24.1.x branch. I tried:
|
Implements on-startup node UUID and ID override via CLI options or node config, as described in 2024-08-14 - RFC - Override Node UUID on Startup.
This PR was originally meant as a fully functional proof of concept, but given the mechanical simplicity of the approach (most of the code here is tests and serdes for the new config types), it has been promoted to PR.
Closes CORE-7000
Closes CORE-6830
TODO:
Backports Required
Release Notes
Improvements