Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-node deployment with bootstrap-in-place #565

Merged
merged 17 commits into from
Feb 4, 2021

Conversation

eranco74
Copy link
Contributor

@eranco74 eranco74 commented Dec 13, 2020

As we add the new single-node production deployment we need a way to install such cluster without an extra node dependency for bootstrap.

This enhancement describes the flow for installing Single Node OpenShift using liveCD that perform the bootstrap logic and reboots to become the single node.

@eranco74 eranco74 force-pushed the iBIP branch 2 times, most recently from c250c32 to 3345034 Compare December 13, 2020 18:19
@markmc
Copy link
Contributor

markmc commented Dec 14, 2020

There's a lot to like here - the motivations/drawbacks/alternatives/etc are all well-captured, and the UX makes sense to me.

I think I'd need a POC to poke at the bootkube/cluster-bootstrap/static pods interactions intelligently. Can you link to the POC? (I see Eran's iBIP branches of openshift/installer and openshift/cluster-bootstrap, but I'd appreciate more hand-holding)

The fact that this doesn't address cloud use cases to begin with might be fine, but I think it would be worth describing whether we believe we can never address cloud with a model like this, or that we have a reasonable idea of how it can work and believe it should be done in a later iteration.

@romfreiman
Copy link

romfreiman commented Dec 14, 2020

@dhellmann
Copy link
Contributor

/retitle single-node deployment with bootstrap-in-place

@openshift-ci-robot openshift-ci-robot changed the title Add proposal for supporting single node production installation using… single-node deployment with bootstrap-in-place Dec 14, 2020
eranco74 and others added 2 commits January 27, 2021 21:34
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
@eparis
Copy link
Member

eparis commented Jan 29, 2021

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2021
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dhellmann
Copy link
Contributor

I thought the goal of this work was to bring the assisted install workflow and "regular" workflow closer together, but after discussing some implementation details with Rom today it became clear that the assisted installer would not be using the new openshift-install command described by this enhancement in any way. There are too many extra behaviors being managed by the agent on the host for it to be replaced by a simple bootstrap ignition, and both the bootstrap and master ignition artifacts are modified by the assisted service or agent before use.

If that's all correct, and CI and developer workflows would be significantly different, I'm no longer sure we need to do this work at all.

Do we need to rethink how we solve the CI and developer use cases? Or is this more useful than I've come to understand?

@romfreiman
Copy link

I think it would be great to have this enhancement and everybody will gain from it being merged and implemented.
coreos-install execution is the least interesting and meaningless part in this whole feature. And the way it's being executed cannot affect the final result (assuming there input is valid :))
The real innovative part here is the bootstrap process that is able to generate an master.ign that causes the node to reboot and become a fully functional SNO. And also the tricky part. How its being written to the disk - I believe is a destruction to the whole discussion. coreos-install is a tool that every UPI BM customer is using and there are no challenges there.
Having this as an option in the openshift-installer will get 99.9% of the flow covered by a CI, and allow developers and some (non-sophisticated users) to achieve SNO without external dependency (when not necessary of course).

@markmc
Copy link
Contributor

markmc commented Feb 3, 2021

In order for this feature to be meaningful to direct openshift-install users, there needs to be an automated way for the installation to be completed. It looks like we're close to consensus in the long comment thread on how to model the coreos-installer args.

However, what has emerged in that discussion is that the assisted installation workflow requires some way to generate the "enriched" master Ignition, but not execute coreos-installer. This is seen as objectionable because there is a reluctance to add a "don't run coreos-installer" interface.

I think many of us (me included) lack a detailed understanding of the assisted installer workflow and, in particular, the coupling of the assisted installer workflow to what are arguably implementation details of the openshift installer. I spent some time on this and captured some notes here: https://hackmd.io/@markmc/B1xPBZ_xu

I'm not sure whether we need to expand those notes further, and include them in this enhancement. Or perhaps a separate enhancement where we agree and track which of these openshift installer implementation details are valid for the assisted install workflow to depend on.

In any case, now that I understand the workflow, here's a pragmatic proposal ...

If bootstrap-in-place-for-live-iso.ign includes a new install-to-disk systemd unit to run coreos-installer after bootkube completes, then the assisted installer will be able to use the bootkube service to generate the enriched master Ignition and then run coreos-installer itself. In other words, the assisted workflow would just not use this install-to-disk service.

@dhellmann
Copy link
Contributor

In order for this feature to be meaningful to direct openshift-install users, there needs to be an automated way for the installation to be completed. It looks like we're close to consensus in the long comment thread on how to model the coreos-installer args.

However, what has emerged in that discussion is that the assisted installation workflow requires some way to generate the "enriched" master Ignition, but not execute coreos-installer. This is seen as objectionable because there is a reluctance to add a "don't run coreos-installer" interface.

I also thought part of the point of this work was to start converging the assisted installer with the openshift installer, so that we only had 1 code path in production, even if different interfaces used different parts of that code path. So it's not just that we don't want to disable running the coreos-installer sometimes, it's that we want it run the same way all the time. Is that not a goal?

I think many of us (me included) lack a detailed understanding of the assisted installer workflow and, in particular, the coupling of the assisted installer workflow to what are arguably implementation details of the openshift installer. I spent some time on this and captured some notes here: https://hackmd.io/@markmc/B1xPBZ_xu

I'm not sure whether we need to expand those notes further, and include them in this enhancement. Or perhaps a separate enhancement where we agree and track which of these openshift installer implementation details are valid for the assisted install workflow to depend on.

I left a couple of questions there before remembering that hackmd doesn't send notifications.

  1. The doc says "After generating the master Ignition file, and before rebooting, the assisted installer uploads logs and sends a status update to the assisted service."

    If the reboot was a separate systemd unit, could the log & status upload be added by the assisted installer after extracting the ignition config?

  2. The doc says "It also has specifies its own arguments to coreos-installer."

    What information is it passing to coreos-installer that wouldn't be available at the time the bootstrap-in-place-for-live-iso.ign file is generated?

In any case, now that I understand the workflow, here's a pragmatic proposal ...

If bootstrap-in-place-for-live-iso.ign includes a new install-to-disk systemd unit to run coreos-installer after bootkube completes, then the assisted installer will be able to use the bootkube service to generate the enriched master Ignition and then run coreos-installer itself. In other words, the assisted workflow would just not use this install-to-disk service.

That adds another implementation detail to the dependency list, which may be necessary depending on the answers to the questions above.

@openshift openshift deleted a comment from romfreiman Feb 3, 2021
@markmc
Copy link
Contributor

markmc commented Feb 3, 2021

In order for this feature to be meaningful to direct openshift-install users, there needs to be an automated way for the installation to be completed. It looks like we're close to consensus in the long comment thread on how to model the coreos-installer args.
However, what has emerged in that discussion is that the assisted installation workflow requires some way to generate the "enriched" master Ignition, but not execute coreos-installer. This is seen as objectionable because there is a reluctance to add a "don't run coreos-installer" interface.

I also thought part of the point of this work was to start converging the assisted installer with the openshift installer, so that we only had 1 code path in production, even if different interfaces used different parts of that code path. So it's not just that we don't want to disable running the coreos-installer sometimes, it's that we want it run the same way all the time. Is that not a goal?

I dunno, it seems clear (based on how the POC implementation was integrated into the assisted workflow) that there was no intention as part of this work to move away from the model of assisted-installer using a workflow of "download and extract the bootstrap ignition, run bootkube, and then pivot by writing the master ignition and rebooting"

I'd be hard pressed to say that "strive to reuse existing code and should not affect existing deployment flows" was intended to mean broader convergence of assisted vs openshift installers

I think many of us (me included) lack a detailed understanding of the assisted installer workflow and, in particular, the coupling of the assisted installer workflow to what are arguably implementation details of the openshift installer. I spent some time on this and captured some notes here: https://hackmd.io/@markmc/B1xPBZ_xu
I'm not sure whether we need to expand those notes further, and include them in this enhancement. Or perhaps a separate enhancement where we agree and track which of these openshift installer implementation details are valid for the assisted install workflow to depend on.

I left a couple of questions there before remembering that hackmd doesn't send notifications.

  1. The doc says "After generating the master Ignition file, and before rebooting, the assisted installer uploads logs and sends a status update to the assisted service."
    If the reboot was a separate systemd unit, could the log & status upload be added by the assisted installer after extracting the ignition config?

It sounds plausible, but quite a challenging change to make - splitting out the log uploading and status updating code into a new binary, passing it credentials, etc.

And I would guess I'm oversimplifying what this separate service would need to do

  1. The doc says "It also has specifies its own arguments to coreos-installer."
    What information is it passing to coreos-installer that wouldn't be available at the time the bootstrap-in-place-for-live-iso.ign file is generated?

Yeah, I haven't investigated yet exactly coreos-installer args the assisted service is sending to assisted-installer, and why

In any case, now that I understand the workflow, here's a pragmatic proposal ...
If bootstrap-in-place-for-live-iso.ign includes a new install-to-disk systemd unit to run coreos-installer after bootkube completes, then the assisted installer will be able to use the bootkube service to generate the enriched master Ignition and then run coreos-installer itself. In other words, the assisted workflow would just not use this install-to-disk service.

That adds another implementation detail to the dependency list, which may be necessary depending on the answers to the questions above.

Right, the new dependency is that running bootkube.service from bootstrap-in-place-for-live-iso.ign will generate an enriched master ignition at /opt/openshift/master.ign

@dhellmann
Copy link
Contributor

If bootstrap-in-place-for-live-iso.ign includes a new install-to-disk systemd unit to run coreos-installer after bootkube completes, then the assisted installer will be able to use the bootkube service to generate the enriched master Ignition and then run coreos-installer itself. In other words, the assisted workflow would just not use this install-to-disk service.

That adds another implementation detail to the dependency list, which may be necessary depending on the answers to the questions above.

Right, the new dependency is that running bootkube.service from bootstrap-in-place-for-live-iso.ign will generate an enriched master ignition at /opt/openshift/master.ign

Also that the bootstrap-in-place-for-live-iso.ign may contain an install-to-disk service that must not perform any steps other than installing the OS (no merging of ignition files, etc.). And that service cannot be renamed because the assisted installer needs to know the name to disable it.

@eranco74
Copy link
Contributor Author

eranco74 commented Feb 3, 2021

However, what has emerged in that discussion is that the assisted installation workflow requires some way to generate the "enriched" master Ignition, but not execute coreos-installer. This is seen as objectionable because there is a reluctance to add a "don't run coreos-installer" interface.
I think that adding the install-to-disk systemd unit you suggested can mitigate that.

I also thought part of the point of this work was to start converging the assisted installer with the openshift installer, so that we only had 1 code path in production, even if different interfaces used different parts of that code path. So it's not just that we don't want to disable running the coreos-installer sometimes, it's that we want it run the same way all the time. Is that not a goal?

I agree that for the single-node use case we should have single way of doing things.
I don't think that this enhancement is relevant for converging the assisted installer with the openshift installer for multi node installation.

  1. The doc says "After generating the master Ignition file, and before rebooting, the assisted installer uploads logs and sends a status update to the assisted service."
    If the reboot was a separate systemd unit, could the log & status upload be added by the assisted installer after extracting the ignition config?

Yes, this seems like a good option, the assisted-installer will use the same code by starting the relevant services once it's ready (same as it does for bootkube, approve-csr, etc today)

  1. The doc says "It also has specifies its own arguments to coreos-installer."
    What information is it passing to coreos-installer that wouldn't be available at the time the bootstrap-in-place-for-live-iso.ign file is generated?

All options are available before starting the installation so that shouldn't be an issue.

@markmc
Copy link
Contributor

markmc commented Feb 3, 2021

If bootstrap-in-place-for-live-iso.ign includes a new install-to-disk systemd unit to run coreos-installer after bootkube completes, then the assisted installer will be able to use the bootkube service to generate the enriched master Ignition and then run coreos-installer itself. In other words, the assisted workflow would just not use this install-to-disk service.

That adds another implementation detail to the dependency list, which may be necessary depending on the answers to the questions above.

Right, the new dependency is that running bootkube.service from bootstrap-in-place-for-live-iso.ign will generate an enriched master ignition at /opt/openshift/master.ign

Also that the bootstrap-in-place-for-live-iso.ign may contain an install-to-disk service that must not perform any steps other than installing the OS (no merging of ignition files, etc.). And that service cannot be renamed because the assisted installer needs to know the name to disable it.

The assisted-service has an explicit list of services it will start, so it doesn't need to know to disable this new service:

        servicesToStart := []string{"bootkube.service", "approve-csr.service", "progress.service"}
        for _, service := range servicesToStart {
                err = i.ops.SystemctlAction("start", service)

(Not disagreeing with the general point that dependencies like this are not ideal, just clarifying the precise implications of this one)

@markmc
Copy link
Contributor

markmc commented Feb 3, 2021

  1. The doc says "After generating the master Ignition file, and before rebooting, the assisted installer uploads logs and sends a status update to the assisted service."
    If the reboot was a separate systemd unit, could the log & status upload be added by the assisted installer after extracting the ignition config?

Yes, this seems like a good option, the assisted-installer will use the same code by starting the relevant services once it's ready (same as it does for bootkube, approve-csr, etc today)

We forgot about the "merge enriched ignition with host ignition" step, which has to happen before coreos-installer runs.

Can you help us understand what sort of edits the assisted installer needs to make to the enriched ignition? I found set requested hostname but that doesn't help with the bigger picture

@staebler
Copy link
Contributor

staebler commented Feb 3, 2021

  1. The doc says "After generating the master Ignition file, and before rebooting, the assisted installer uploads logs and sends a status update to the assisted service."
    If the reboot was a separate systemd unit, could the log & status upload be added by the assisted installer after extracting the ignition config?

Yes, this seems like a good option, the assisted-installer will use the same code by starting the relevant services once it's ready (same as it does for bootkube, approve-csr, etc today)

We forgot about the "merge enriched ignition with host ignition" step, which has to happen before coreos-installer runs.

Can you help us understand what sort of edits the assisted installer needs to make to the enriched ignition? I found set requested hostname but that doesn't help with the bigger picture

What types of items need to be in the enriched ignition instead of being handled by a MachineConfig?

@romfreiman
Copy link

@matthewcarleton it is used by BM customers for various adjustments - file system customization, advanced networking, other changed that should happen before the ingnition is being applied.
In AI, we provide an API that provides the customers this flexability.
But this is not only about master.ing customization. AI also relies on the fact that it executes the coreos-install. For example:

  1. We update adjust installation timeouts according to the disk write speed - we saw that it can take 30min or more in certain cases. So we keep on tracking the progress and adjsuting the timeouts on the fly
  2. We provide full reporting of the progress to the use - they can see that it was written 53% of the content already. It appears as AI events. Just try to deploy AI based cluster and you'll see the experince.
  3. We monitor all our installations and planning to adjust the the bahviour according to the metrics. We have data for example saying of how long does it take to write coreos to disk by hw type, disk type, and so on.

image

Hope it provides more information.

@romfreiman
Copy link

I almost forgot - there is also minimal iso as MUST requirement for 4.8

…temd

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
in the `install-config.yaml`, using a schema like

```yaml
coreOSInstallation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of these changes. I would like to reserve the right to adjust the name of this field to bring it out of the implementation domain and into the user domain. I do not want to hold up merging this PR while we debate this name, however. We can work on this name as we work on adding additional fields, with the understanding that the name needs to be finalized soon (certainly before we ship 4.8).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting.

IMO, I think that makes sense in general for all enhancements - we should bias towards merging and iterating on the details, once the fundamentals have been worked through. The fundamental here was the need for an API and its requirements, not the details of its design.

@romfreiman
Copy link

romfreiman commented Feb 4, 2021 via email

@markmc
Copy link
Contributor

markmc commented Feb 4, 2021

Thanks all, I think this approach of adding an install-config section for coreos-install config and adding a new systemd unit for running coreos install, is a pragmatic and important solution.

The note on "Tight coupling of assisted installer and OpenShift installer implementation" is an important observation too - I think we should discuss how to follow up on this in the near future

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2021
@openshift-merge-robot openshift-merge-robot merged commit 7ef84c9 into openshift:master Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.