Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5.0.0: backup/restore apps, overhaul TTS, add new node to existing cluster + 🐛 fixes #210

Merged
merged 346 commits into from
May 15, 2024

Conversation

jessebot
Copy link
Collaborator

@jessebot jessebot commented Apr 12, 2024

Features

Sensitive Values Overhaul

  • You can now specify your own sensitive values via environment variables using the new value_from map for both apps.APP_NAME.init.values and any value under apps.APP_NAME.backup
    • also has initial support for bitwarden though it's still untested, and openbao is also coming
    • the apps.APP_NAME.init.sensitive_values list has been removed
    • example of providing a sensitive value for init values:
apps:
  nextcloud:
    init:
      values:
        admin_user: nextcloud_admin_user
        smtp_user: my-smtp-username
        smtp_host: my-smtp-host.com
        # this value is taken from an external source
        smtp_password:
          value_from:
            # you can change this to any environment variable present at the time of running smol-k8s-lab
            env: NC_SMTP_PASSWORD

NOTE: the TUI doesn't support setting sensitive values via value_from right now. It will just pull your sensitive value and change it to dots. This feature will come at a later date.

Backups and Restores

currently only supported for a handful of apps (nextcloud, matrix, mastodon, home assistant, and zitadel), but more coming soon!

  • Added support for seaweedfs/app specific pvc restores, and postgresql restores.
  • Added support for specifying backup times (in cron syntax)
  • Also added support for Nextcloud maintenance window time (cron syntax for both start and end time)
  • We now support a global StorageClassName called apps_global_config.pvc_storage_class in the yaml (used by default for nextcloud and matrix right now)
  • Each app that supports backups and restores now has a tab for both on the apps screen and a section in the config.yaml.
  • The backup config.yaml section looks like this:
apps:
  nextcloud:
    backups:
      # cronjob syntax schedule to run nextcloud pvc backups
      pvc_schedule: 10 0 * * *
      # cronjob syntax (with SECONDS field) for nextcloud postgres backups
      # must happen at least 10 minutes before pvc backups, to avoid corruption
      # due to missing files. This is because the cnpg backup shows as completed
      # before it actually is, due to the wal archive it lists as it's end not
      # being in the backup yet
      postgres_schedule: 0 0 0 * * *
      s3:
        # these are for pushing remote backups of your local s3 storage, for speed and cost optimization
        endpoint: s3.eu-central-003.backblazeb2.com
        bucket: my-nextcloud-bucket
        region: eu-central-003
        secret_access_key:
          value_from:
            env: NC_S3_BACKUP_SECRETKEY
        access_key_id:
          value_from:
            env: NC_S3_BACKUP_ACCESS_ID
      restic_repo_password:
        value_from:
          env: NC_RESTIC_REPO_PASS
  • The restore section looks like this:
apps:
  nextcloud:
    init:
      enabled: true
      # this is the restore section, as it's a type of initialization
      restore:
        enabled: true
        # for postgresql cluster restores using the cloud native postgresql operator
        cnpg_restore: true
        # these can all be set to any restic snapshot ID (long or short), but they default to latest
        restic_snapshot_ids:
          seaweedfs_volume: latest
          seaweedfs_filer: latest
          seaweedfs_master: latest
          nextcloud_files: latest

Overhaul the text to speech features to be their own widget called SmolAudio

  • pre-generated audio files for each common thing that would need to be said, but you can still use your own TTS CLI if you'd like by providing smol_k8s_lab.tui.accessibility.text_to_speech.speech_program
  • sets a default language of "en" for english, but dutch (nl) is also partially available in the config.yml
    • creates a config/audio/en.yaml and config/audio/nl.yaml for custom titles, descriptions, phrases, and common words
  • you can now enable and disable screen titles and/or descriptions separately
  • creates a small dev-only program called smol-tts for generating text to speech audio files for each language
  • tab names for tabbed content, switch values, and drop down menu values are now read out for when pressing F5
  • we've re-created all the screen descriptions and and separated out screen titles from screen descriptions

New nodes for k3s after cluster is already up

  • You can now specify an SSH port when adding new nodes to k3s installs (defaults to 22)
  • Adds a "modify nodes" button to the cluster modal screen for k3s, so you can add nodes after the cluster is already installed on a new screen

Release process

We will now attempt to release an appimage each time we release :) This will assist in ensuring the brew install goes more smoothly in the future. This has never been done for this project before, so we expect some initial growing pains on this. Please be patient as we get a consistent appimage.

Misc

  • we now support python 3.12!
  • clusters datatable now displays the OS platform and the kubernetes version for each row
  • in the TUI, there are now sync/delete links baked into the bottom border for each app that is available in ArgoCD and enabled. Closes Delete app via the TUI #109
  • TUI config screen's k9s section is removed entirely and the whole accessibility section is now at the top of the screen
  • k9s section of both the TUI and the CLI has been replaced with run_command section that allows you to run commands either during smol-k8s-lab's app config phase, or after it. It looks like this:
smol_k8s_lab:
  run_command:
    # command to run after smol-k8s-lab tui is done or immediately when running
    command: k9s --command applications.argoproj.io
    # tell me which terminal you use if you'd like to use split or tab features, only supports wezterm and zellij right now. submit issue/PR for more options :)
    terminal: wezterm
    # where to run the command, options: same window, new window, new tab, split left, split right, split top, split bottom
    # if set to "same window", we just run the command in the same window after we're done the entire smol-k8s-lab cli run
    window_behavior: split right
apps:
  zitadel:
    argo:
      # git repo to install the Argo CD app from
      repo: https://github.com/small-hack/argocd-apps
      # path in the argo repo to point to. Trailing slash very important!
      path: zitadel/app_of_apps/
      # either the branch or tag to point at in the argo repo above
      revision: add-pvc-helm-chart-for-nextcloud
      # kubernetes cluster to install the k8s app into, defaults to Argo CD default
      cluster: https://kubernetes.default.svc
      # namespace on destination cluster to install the k8s app in
      namespace: zitadel
      # recurse directories in the provided git repo
      directory_recursion: false

Bug Fixes

  • CNPG operator is now installed during the operator phase of installs
  • print element web interface URL instead of matrix URL at the end of the run
  • fix defaults for Bitwarden CLI env vars: BW_HOST, BW_SESSION
  • fix bug where when completely unauthenticated to bitwarden, the unlock never happened. now if not authenticated, we authenticate, then unlock the vault.
  • fix bug where if apps_global_config.external_secrets was set to bitwarden, the bitwarden credentials were requested even if password manager was disabled AND the external secrets operator app was disabled
  • generic device plugin could crash if it didn't have enough memory so we switched to helm chart for the same app
  • changes button at bottom of new nodes box to say "➕ node" for "add remote nodes" tab
  • new node and new option key bindings were previously not accessible if you were already focused on an input field. to solve this, we've changed that key binding to be ctrl+n so it's always something that can be pressed
  • changes the cancel button the cluster modal to be in the bottom border of the screen to preserve screen real estate
  • fixes padding and gutters for the css of the add nodes tab and screen for k3s
  • fixed issue where when screen titles were enabled, they were read before the widgets on the screen were fully loaded

Misc changes

  • Cleaned up a bunch of whitespace
  • add some more comments to the default config.yaml
  • updates default input_field function to include a default empty validator
  • adds name of cluster being edited to the title of most screens
  • did a cleanup pass of the docs website

outstanding tasks

  • document backups with nextcloud and matrix
  • document restore process with nextcloud and matrix
  • add contributing page and planned/requested features page to docs
  • update generic device plugin docs
  • verify/create bitwarden docs

This PR will be merged in conjuction with: small-hack/argocd-apps#695

@jessebot jessebot added ✨ enhancement New feature request 🩹 Bug Fix labels Apr 12, 2024
@jessebot jessebot requested a review from cloudymax April 12, 2024 10:54
@jessebot jessebot self-assigned this Apr 12, 2024
@jessebot jessebot changed the title Feature: restore app + some 🐛 fixes Features: restore app, add new node to existing cluster + some 🐛 fixes Apr 14, 2024
@jessebot jessebot changed the title Features: restore app, add new node to existing cluster + some 🐛 fixes Features: restore app, add new node to existing cluster + 🐛 fixes Apr 19, 2024
@jessebot jessebot changed the title Features: restore app, add new node to existing cluster + 🐛 fixes Features: restore app, overhaul TTS, add new node to existing cluster + 🐛 fixes Apr 19, 2024
@jessebot jessebot changed the title Features: restore app, overhaul TTS, add new node to existing cluster + 🐛 fixes New Major Version: backup/restore apps, overhaul TTS, add new node to existing cluster + 🐛 fixes May 9, 2024
@jessebot
Copy link
Collaborator Author

just some docs to do, and then we're all set :D

@jessebot jessebot marked this pull request as ready for review May 15, 2024 08:53
@jessebot jessebot changed the title New Major Version: backup/restore apps, overhaul TTS, add new node to existing cluster + 🐛 fixes v5.0.0: backup/restore apps, overhaul TTS, add new node to existing cluster + 🐛 fixes May 15, 2024
@jessebot jessebot merged commit 857a22f into main May 15, 2024
4 checks passed
@jessebot jessebot deleted the feat/restore-app branch May 15, 2024 10:05
jessebot added a commit that referenced this pull request May 15, 2024
… cluster + 🐛 fixes (#210)

* actually apply the seaweedfs appset after restoring the seaweedfs PVCs

* fix subproc calls for recovery job checking

* refine postgres recovery job checking a bit more

* add some color to success and failure reporting in the logs and allow more restore job checking to fail

* drastically simply how we check that the recovery job is done by just waiting on it

* put restore into it's own tab

* make sure if we can't get the deployment immediately for nextcloud, we keep trying

* add restic to required docs

* fix cap header for matrix

* fixing matrix restores and parametizing more of nextcloud restores

* switch to using argocd as an object

* updating poetry lock file

* fix typo of recusion to recursion

* start fleshing out the new backup button to do restic pvc backups

* add cnpg backups to default supported backups

* rig up backup button for on demand backups ❇️

* databases exist outside of nextcloud

* simplify nextcloud occ commands in backup.py

* change name from postgresql to postgres

* overhaul of backups and restores. backups via the tui should work for nextcloud now

* move value_from function out of tui widget and into general utils as smol_k8s_lab.utils.value_from.extract_secret

* update value from function to do more error checking

* use backups section instead of secret keys but still update secret keys in appset secret plugin secret

* move repetitive backup processing to value_from lib and have nextcloud use it

* setup backups and restores for matrix

* finish up initial setting up of backup and restore functions for zitadel, mastodon, matrix, nextcloud, and home assistant

* catch error of unable to get serverVersion when docker is not enabled and the cluster is k3d or kind. we now log an error and suggest enabling docker but set platform and version to unkown

* should be serverVersion not semverVersion

* generate audio for macos 64 bit arm, and unknown cluster versions

* make backup jobs unique by giving them timestamps

* update poetry lock

* finally found the perfect kubectl cmd to make backup button finish

the backup button wasn't finishing because the job wait command was timing out. set timeout to 15m because backup can take a really long time depending on how much data you are backing up and what your connection speeds are

* split off trigger_backup into its own worker method

* change where we we declare the hostname for zitadel

* fix up disabling of displayed rows for restore widget

* fix home assistant header and vouch's too, but also clean up unused keycloak stuff in vouch

* fix display of snapshot grid on start if we have restores disabled

* cache getting restore_enabled and snapshots out of dicts into self for restores widget

* change RestoreAppConfig to RestoreApp and change references to restic_snapshot_ids to snapshots within that widget

* update backup tooltips and try to speed up mounting

* never print output from create secret unless there's an error

* fix variable names for vouch and comment out more keycloak stuff

* update how we do zitadel headers so we talk about explicitly syncing vs setting up zitadel

* quietly do backups in the background via the backup widget

* display an orange loading indicator while we do the backup in the background

* fix more places where we don't need spinner if this is called from the tui

* fix notfiy spacing and add tooltip to loading indicator for backups in tui

* speed up input widget a tad

* clean up names of smol-k8s-lab generated backups and further clean up backup notifications in tui

* fix color of header rows

* fix schedule name input for backup widget

* fix OAuth typo

* add backup credentials to default generated home assistant credentials

* fix home assistant s3 backups credentials

* fix tool tips for s3 backups section in tui to display key instead of value

* catch issue where sometimes cnpg restore is not possible at all

* fix issue where we were using _ instead of - for home assistant backups and restores

* catch more issues with _ vs -

* create home assistant pvc with new pvc capacity

* update constants for smol-tts to output audio to a config directory

* allow for running using integrated macos gpu when on arm64 machine types, else, check for cuda, and if not cuda use cpu for torch

* update poetry.lock for a mac

* only generate audio file if the old one doesn't exist

* update smol-tts to do more checking before regenerating an mp3

* fix underscore to hyphen issue, again, with home assistant

* fix restic repo password for prcoess backup vals func

* fix backup schedule appset secret plugin updates

* always apply the external secrets for home assistant restores

* fix allt he references to external_secrets_appset.yaml to be external_secrets_argocd_appset.yaml

* udpate to use pyglet instead of playsound or playaudio

* update poetry lock

* switching to pyglet everywhhere

* add a plain non-k8up restic restore job and a recreate_pvc function to share between that and the seaweedfs pvc creation

* add timestamps to restore jobs and mount_path to plain restic restore job

* add a wait section to restore plain restic job function

* reload home assistant deployment after we restore it and template out the home assistant namespace for restores

* allow always restoring home assistant, even if it's already installed and running

* optimize getting deployments and pod names and always use defined argocd namespace for appset secret plugin

* fix create_resitc_restore_job typo to be create_restic_restore_job

* switch to using sync argocd app instead of refreshing deployment for appset secret plugin

* need to pass in HOME to get restic snapshots, need to pass in namespace to put the restore job in the right place

* fix where we get home assistant namespace and fix occurances of tolerations_ to be toleration_ for all variables

* adding namespace to getting pod names and making sure to not get list index of pods unless it list is populated

* fix where we get sensitive values, and make sure we get restic_repo_password with a default value

* switching to pygame for audio because nothing else is consistent

* debian: verify pygame is now working appropriately for audio in the tui

* remove commented cruft

* add delete app button

* fix delete button spacing

* fix restore_seaweedfs call for nextcloud and allow rollout check to fail

* add some more logging for syncing and deleting apps

* add some error catching for if we can't find a nextcloud pod, and use our K8s lib for getting the pod

* add some more logging and checking around restores and use re-usable function for restoring app PVCs for matrix and nextcloud

* catch issue where sometimes a snapshot ID is only numbers, so we convert the int to a str

* restores: label values file with app name, reuse barman object for cnpg restore, remove trailing slash from s3 bucket destination for cnpg

* name the cnpg cluster the same as the end result when recovering

* allow anything with postgres-cluster to grab the cnpg-cluster targetRevision from argocd

* don't require getting pod to finish with return for nextcloud, add a timeout of 30 minutes to the postgres restore job

* allow extra labels for getting pod name

* update how we fix maintainence mode for nextcloud after restore

* make recovery backup and scheduled backup sections for cnpg {} instead of [] and use copy of barman_obj for recovery

* fix incorrect username used for restores of cnpg

* clean up unused values for cnpg operator

* simplify the restore dict updates after restore for cnpg cluster

* try installing alsa for linux ci

* add docs about installing alsa on debian

* attempt to get alsa working via ci

* only mess with secrets if matrix's restore is enabled

* fix post restore job for cnpg

* fix matrix namespace declaration

* simplify updating matrix pvc during restores by templating the pvc name

* set externalClusters to [] after restore of cnpg cluster

* add wal parallel back to backups and compress the restore dict a bit

* adding gzip and maxparallel 8 for wal archive for cnpg restores

untested with matrix or nextcloud

* try just disabling mixer if audio device not enabled

* add log message of no audio device found

* remove gzip from wal archives for cnpg restores

* adjust wall archives to be 4 at once instead of 8

* max said they would order pizza when this was working :fingers_crossed:

* move minio_lib to utils and add get_object and list_object methods, then make sure we pass in the backup id to restores

* always make sure the final wal archive is there for backups of cnpg databases

* update ArgoCD to have optional k8s requirement

* add backup credentials getting

* always grab the s3 endpoint if cnpg restores are enabled

* only use ArgoCD in apps_screen if this is an existing cluster

* fix namespace missing from backup

* add .decode('utf-8') to get str of pgsql s3 creds

* don't show backup now button unless this is an existing cluster

* update backups to always check for end wal file for cnpg, and clean up backups tui

* check in attempts to make restores work again

* wait an additional 30 seconds on that backup just in case

* wait for s3 to be up before applying recovery job for cnpg operator, and always download the backup.info

* retry syncs if they fail

* immediately install the argocd appset plugin before argo is fully managed by itself

* update install the argocd appset plugin

* add more logging for restores and call it restore_cnpg_cluster instead of restore_postgresql

* maybe fix appset secret plugin url

* fix missing updates of s/restore_postgresql/restore_cnpg_cluster/

* updating argocd appset plugin to create the argocd project ahead of time

* actually break out of loop checking for s3 being up for cnpg restores

* make sure to get the correct source_repos for the project, and properly template all the namespaces too

* fix unexpected key error for vouch

* switching back to immediate restore and adding more backup safety checks

* change how we do backups for cnpg to always wait till we can consistently get the correct wal, this time for real, we hope

* add a bit of a pause between checks for the backup.info file in s3 for cnpg restores

* use new kwargs format for helm class

* accomadate postgres and pvc schedule settings

* switch from s3_user and s3_pass to secretAccessKey and accessKeyId

* finish up standardizing s3 credentials

* don't check in any logs we generate locally

* simply restores everywhere and always take postgres schedule for restores as a variable

* add a basic wait command for kubernetes and make sure we wait for seaweedfs to be fully up before continuing restore process

* fix default config to run postgres backups at midnight and file backups at midnight ten

* keep trying after a wait fails to find resources for smol-k8s-lab

* allow waits to fail for k8s and set loglevel to warn for argocd app wait/sync

* try to fix calls to helm lib

* add comment about what we're doing in argo setup func

* don't show the hello from pygame message

* always ignore the main en and nl dirs

* always ignore the full audio files

* update appimage creation process

* linting and commenting

* add tar and untar commands

* print how long it took in both english and dutch, change order of checking which options were passed in for smol-tts

* add a keys section and update unknown verison text

* add project name for argocd tests

* add argocd_config['argo']['cluster'] for all the ci tests

* speed up init values loading by swiching away from a collapsible

* clean up colors

* change green to explicit hex

* fix mastodon restore error of too many arguments

* fix space typo for argocd app sync command

* optimize tui loading for apps screen a bit more

* add first app is audio for selection list

* add first app is audio for selection list

* cleaning up and refactoring for speed and audio in tui

* remove a layer of vertical scroll container for apps screen

* update how we deal with unfound audio files; also add additional phrases; also fix the scroll bars and nested containers

* tidying help text screenshot

* move k9s to be run command and move (and subproc) under utils.run

also do minor refactor of both smol-k8s-lab and tui-config screens

* clean up run command some more and upgrade textual

* cleaning up screenshots

* finally finish up final_command styling and option selection

* change all - commands to have spaces instead for run_command

* actually insert the final command

* fix option evaluation for final run command

* fix ci tests to include final command test and make sure we accept same window as option for window behavior

* update credentials screen sizing and screenshot

* change size of apps config modify globals button

* add id to modify globals button so we can use it for tcss queries. add new screnshot of apps screen

* update apps screen screenshot again

* cleaning up a bit

* more cleaning

* update existing clsuter screenshot

* add new start screen with existing cluster example

* update start screen screenshot

* add better logging password config and run command screenshot

* fix modal screen buttons for some font types

* update new node widget and screen

* adding new screenshots for new node widget and new nodes screen

* add modify global params modal screen screenshot

* update make screenshots script

* add modify node modal screen

* tidy up the audio for node modification screens

* add delete node modal screenshot

* docs: add new apps screen screenshots, linting, replace jessebot with small-hack org

* linting and updating descriptions

* update the add remote node screenshot and alt text

* update tui screenshots and config file examples

* add cluster parameter to all apps and change ref to revision anywhere that was left over

* update the backup sections of all the backup supported apps and also all the sensitive values for all the supported apps, and update the libraries and format of default landing page

* do a minor clean up of all the experimental apps

* add new input names for k3d and kind node inputs for audio

* finish up generating audio for all of the distro screen for both kind and k3d

* update networking tab audio to be 'networking options tab'

* add backup and restore tabs for audio generation

* add some more phrases for backups

* fix saying app bug

* fix how we say PVC

* fix more backups input audio

* more troubleshooting of restic repo password audio

* fix restic repo input audio generation

* update s3 configuration collapsible audio generation

* regenerate many input fields audio

* add more input to the ends of things

* update audio widget to process node datatable and always say input after input id is read

* add button as a default thing we say and remove button from ending of all other phrases

* add button say method

* update screen descriptions for config screens

* remove word button from focused so we don't try to say it twice

* switch to saying drop down menu if we find a select

* switch to special switch method

* add switch phrase

* try to say split better and add window behavior select phrase

* clean up more input fields to reduce words needed

* add some more links for accessibility

* adding the audio files finally

* change all the refs of feature branch back to regular main branch and change verison back to v5

* switch from valueFrom to value_from to be consistent

* update docs for both nextcloud and matrix backup and restores

* add a basic roadmap

* update help image

* add more roadmap stuff

* prep for appimage test

* add logo for smol-k8s-lab, why not

* updating deps

* add latest audio tarball

* update appimage config yaml for testing

* note that brew is still wonky and disable generating audio on tag

* update home assistant and zitadel backups and restores and clean up typos in matrix and nextcloud
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants