add boot hosts #1

yakimant · 2023-09-08T13:49:17Z

No description provided.

yakimant · 2023-09-11T11:36:14Z

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
export BW_SESSION=SMTH

yakimant · 2023-09-11T11:37:24Z

Current issues:

SSH access issues for DO and AC, but not GC
iptables issue: Set sshguard4 doesn't exist
- not an issue after ansible/requirements.yml update

yakimant · 2023-09-11T11:56:34Z

SSH keys setup:

# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
      + ssh_keys             = [
          + "20671731",
        ]

# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
      + key_name                           = "jakubgs"

yakimant · 2023-09-11T13:19:51Z

For DO we can do it:
status-im/infra-tf-digital-ocean#1

For AC, looks like one key allowed only:
https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance

As an alternative:

~~we could configure it as variable in infra-tf-multi-provider, so each devops override it locally~~ this will not work, key pair is once and forever
have a share ssh key (security risks)
run ansible only on CI with shared key

Proper solution was to change ansible role locally.

yakimant · 2023-09-11T14:26:15Z

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic:
https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7

yakimant · 2023-09-12T11:12:45Z

add ssh keys for Anton and Alexis infra-tf-digital-ocean#1 is merged
For AC ssh keys issue is fixed by editing .terraform/modules/boot.ac-cn-hongkong-c/variables.tf

yakimant · 2023-09-12T11:30:27Z

New issues:

AC: infra-role-bootstrap-linux : Make sure essential pip packages are installed TAGS: [role::bootstrap:packages] fails on with AttributeError: cython_sources
- "AttributeError: cython_sources" with Cython 3.0.0a10 yaml/pyyaml#601
- Option 1: Downgrade PyYAML to 5.3.1 (from 5.4.1)
- Option 2: Official workaroung: pip install "Cython<3.0" && pip install "PyYAML==5.4.1" --no-build-isolation: https://github.com/status-im/infra-role-bootstrap-linux/pull/33
- Option 3: Investigate, why it doesn't reproduce on GC and DO

DO: infra-role-bootstrap-linux/raw : Install mandatory packages runs endlessly

Stuck on "Scanning processes" phase of apt install:

apt -y install python3-minimal acl
  `-apt -y install python3-minimal acl
      `-sh -c test -x /usr/lib/needrestart/apt-pinvoke && /usr/lib/needrestart/apt-pinvoke || true
          `-frontend -w /usr/share/debconf/frontend /usr/sbin/needrestart
              |-needrestart /usr/sbin/needrestart
              `-whiptail --backtitle Package configuration --title Daemons using outdated libraries --output-fd 11 --separate-output --checklist \012\012Which services should be restarted? 12 47 2 -- packagekit.service  on unattended-upgrades.service

https://github.com/status-im/infra-role-bootstrap-linux/pull/32

yakimant · 2023-09-12T14:24:30Z

whiptail is for dialogs, probably it's waiting for some input

yakimant · 2023-09-12T14:46:43Z

Looks like needrestart should be setup for non-interactive ansible:

yakimant · 2023-09-12T16:11:46Z

Other issues

infra-role-bootstrap-linux : Docker | Install package failing with:
- ```
'/usr/bin/apt-get -y -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"       install 'docker-ce=5:24.0.6-1~ubuntu.22.04~jammy' 'docker-compose=1.29.2-1'' failed: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 13119 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
```
- rerun helps
- also infra-role-wireguard : Install WireGuard packages on AC
- Links:
- Possible solutions:
  - lock_timeout apt option (since 2.12)
  - checking lock with lsof or fuser
  - pgrep with unatended or other proccesses (bad)
  - Official before lock_timeout
```
register: apt_action
retries: 100
until: apt_action is success or ('Failed to lock apt for exclusive operation' not in apt_action.msg and '/var/lib/dpkg/lock' not in apt_action.msg)
```
  - system updates to finish: systemd-run --property="After=apt-daily.service apt-daily-upgrade.service" --wait /bin/true

infra-role-bootstrap-linux : Consul | Create consul config directory fails with

AnsibleError: An unhandled exception occurred while templating '{{lookup("bitwarden", "consul/cluster", field="encryption-key")}}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'bitwarden'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0). Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0)

bw unlock and export helped

infra-role-bootstrap-linux/raw : Install mandatory packages returned on DO:

E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_multiverse_cnf_Commands-amd64 - open (2: No such file or directory)

│ E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 - open (2: No such file or directory)
...
│ E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
│ E: Sub-process returned an error code

~~looks like a corrupted cache, probably rm -rf /var/lib/apt/lists/* && apt update should do the trick~~
retry helps

infra-role-bootstrap-linux : Netdata | Restart service:
- ```
Could not find the requested service netdata: host
```

yakimant · 2023-09-13T13:23:45Z

netdata.service is not installed:

# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...

Unfornutely it doesn't fail the installation.

yakimant · 2023-09-13T14:01:12Z

This code fails to detect systemd:
https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

yakimant · 2023-09-13T15:58:39Z

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64

Looks like they reverence the cloud specific repo mirrors.

jakubgs · 2023-09-13T16:40:55Z

I don't get your issue with Netdata, the _check_systemd function works fine:

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh

_check_systemd() {
  pids=''
  p=''
  myns=''
  ns=''

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0
  fi

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh 
YES

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

jakubgs · 2023-09-13T18:23:06Z

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

yakimant · 2023-09-14T08:45:25Z

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import functions.sh and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

yakimant · 2023-09-14T09:29:17Z

Another minor issue:
fuser exits with 1 if files not open by other cpu or one of the files doesn't exit:

# fuser /var/cache/fwupd/metadata.xmlb
/var/cache/fwupd/metadata.xmlb:  6080m
# echo $?
0

# fuser /var/cache/fwupd/noneexist
Specified filename /var/cache/fwupd/noneexist does not exist.
# echo $?
1

# fuser /var/cache/apt/archives/lock
# echo $?
1

I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.

yakimant · 2023-09-14T09:31:52Z

You need to install "jmespath" prior to running json_query filter

│ TASK [infra-role-bootstrap-linux : Volume | Identify device without partitions] ***
│ fatal: [8.218.174.108]: FAILED! => {}
│
│ MSG:
│
│ You need to install "jmespath" prior to running json_query filter

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Fix:
Install on the controller node:

pip install jmespath

Follow-up:
Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

yakimant · 2023-09-14T09:59:45Z

Alibaba Cloud images:

shards.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.shards:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

yakimant · 2023-09-14T10:20:50Z

More on the netdata installation:

They even have a community supported playbook:
https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible
which runs the kickstart.sh script which will likely install deb from a repo.

The most popular role from Galaxy:
https://github.com/mrlesmithjr/ansible-netdata/
runs netdata-installer.sh

Why they are so obsessed with installer scripts?

yakimant · 2023-09-14T10:57:00Z

ssh fingerprint issue duting the 'role::bootstrap:users tasks.
Can happen on different steps, eg:

TASK [infra-role-bootstrap-linux : Create users groups] ************************
│ fatal: [8.218.174.108]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "8.218.174.108". Make sure this host can be reached over ssh: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
│ The fingerprint for the ED25519 key sent by the remote host is
│ SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
│ Please contact your system administrator.
│ Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
│ Offending ED25519 key in /Users/status/.ssh/known_hosts:169
│ Agent forwarding is disabled to avoid man-in-the-middle attacks.
│ UpdateHostkeys is disabled because the host key is not trusted.
│ root@8.218.174.108: Permission denied (publickey).

Reproduced:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Workaround:
Rerun Ansible without recreating an instance.

yakimant · 2023-09-14T11:05:24Z

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

yakimant · 2023-09-14T11:08:45Z

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [8.218.174.108] => {
     "changed": true,
     "rc": 0
 }

 STDERR:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is
 SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to 8.218.174.108 closed.

yakimant · 2023-09-14T11:10:02Z

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd

No (deleted), so parsed properly.

yakimant · 2023-09-14T11:21:44Z

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

yakimant · 2023-09-14T12:06:53Z

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

yakimant · 2023-09-14T12:25:18Z

ssh connection refused on GC:

│ TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Failed to connect to the host via ssh: ssh: connect to host 34.135.13.87 port 22: Connection refused

Reproduce:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround:
Ansible rerun helps

jakubgs · 2023-09-14T12:30:17Z

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

jakubgs · 2023-09-14T12:31:08Z

Also, I would recommend keeping research like this in the issue, and not in the PR.

yakimant · 2023-09-14T12:33:53Z

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday.
I just post whatever issues I encounter and rerun Ansible, which helps so far.

yakimant · 2023-09-14T12:36:17Z

ssh Permission denied issue during the role::bootstrap:users tasks on GC

│ TASK [infra-role-bootstrap-linux : Kill ubuntu user processes] *****************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "34.135.13.87". Make sure this host can be reached over ssh: admin@34.135.13.87: Permission denied (publickey).

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user:
https://github.com/status-im/infra-role-bootstrap-linux/pull/34

yakimant · 2023-09-14T12:38:19Z

I will create proper Issues afterwards as a follow-up.

yakimant · 2023-09-15T10:38:28Z

Trying to find an image, which supports this change:
status-im/infra-role-nim-waku@0de1508
statusteam/nim-waku:deploy-wakuv2-shards
statusteam/nim-waku:deploy-wakuv2-test
doesn't support:

Unrecognized option 'pubsub-topic'
Try wakunode2 --help for more information.

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

yakimant · 2023-09-15T10:55:55Z

waku-peers fails to start:

$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] Connecting to Consul: localhost:8500
[INFO] Found 5 data centers.
[DEBUG] Querying: nim-waku (dc=do-ams3, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Found: boot-01.do-ams3.shards.test (env:shards,stage:test,nim,waku,libp2p)
[DEBUG] Querying: nim-waku (dc=aws-eu-central-1a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=he-eu-hel1, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=gc-us-central1-a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=ac-cn-hongkong-c, node_meta={'env': 'shards', 'stage': 'test'})
[INFO] Found 0 services.
Traceback (most recent call last):
  File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
    main()
  File "/usr/local/bin/connect_waku_peers.py", line 125, in main
    raise Exception('No services found')
Exception: No services found

~~probably~~ because no other nodes are started, will setup others now

yakimant · 2023-09-15T11:18:22Z

another issue with waku-peers:

$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug

[DEBUG] RPC Call URL: http://localhost:8545
[DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0}
Traceback (most recent call last):
  File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
    main()
  File "/usr/local/bin/connect_waku_peers.py", line 142, in main
    raise Exception('RPC Error: %s' % rval['error'])
Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31'}

Disabled the role, but probably it will fire up later

yakimant · 2023-09-15T12:52:28Z

loop_control issues, label should be a string:

❯ ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "open-ports" -i ansible/inventory/test -v
Using /Users/status/work/infra-shards/ansible.cfg as config file
ERROR! The field 'label' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleMapping'>

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  loop_control:
    label:
    ^ here

https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23

Removed loop_control as a workaround

yakimant · 2023-09-15T13:00:58Z

nim_waku_node_key extraction from file files if already created and not setup by variable

ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "nim-waku" -i ansible/inventory/test -v
...
TASK [nim-waku : Generate random node key] ***********************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "not key_file.stat.exists and nim_waku_node_key is not defined\n",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Save generate node key to file] *****************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "not key_file.stat.exists",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Load existing node key from file] ***************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Extract the node key from file] *****************************************************************
fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Extract the node key from file
  ^ here

Load is skipped wrongly, because generation is skipped.
Maybe it should be the opposite? If generation is skipped - load from file.

https://github.com/status-im/infra-role-nim-waku/blob/75fa7e483cacccb482c99afddc7de3c25fb8a1fc/tasks/nodekey.yml#L31-L37

yakimant · 2023-09-15T13:06:48Z

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package.
Will check in the related issue.

yakimant · 2023-09-15T14:55:32Z

Potential temporary workaround for netdata:
copy
/opt/netdata/system/netdata.service
to
/lib/systemd/system/netdata.service

jakubgs

I would split this up into separate PRs, partially because you filled this PR with comments that are research/debug comments, and partially because you are doing a LOT in one commit. I'd say PRs are for discussing the changes when reviewing, and issues are for your own research and debugging. That way a person that wants to review a PR doesn't get greeted by a wall of text of 40+ comments that have no relevance to their review.

My recommendation would be:

Remove commented out stuff unless it actually is relevant to future work needed.
Separate out general terraform changes into one commit, like secrets setup or provider setup.
Separate out into another commit the creation of the fleet and its emergency inventory.
Separate out the Ansible playbook and group variable changes.

Also, I would start without any bootstrap__active_extra_users, and grant them on case-by-case basis as they request it. Unless you were already told to grant the same access as from some other fleet.

ansible/group_vars/boot.yml

ansible/main.yml

yakimant · 2023-09-18T10:25:43Z

This PR is closed in a favour of these 3 as requested by @jakubgs:

The following issues were discovered during the work on this PR:

yakimant marked this pull request as draft September 8, 2023 13:49

yakimant marked this pull request as ready for review September 15, 2023 12:41

yakimant force-pushed the add_boot_hosts branch from 9d776c6 to a7d9094 Compare September 15, 2023 12:41

yakimant force-pushed the add_boot_hosts branch from a7d9094 to 9c45c37 Compare September 15, 2023 12:56

terraform and ansible configuration for boot hosts

79077eb

yakimant force-pushed the add_boot_hosts branch from 9c45c37 to 79077eb Compare September 15, 2023 13:13

jakubgs requested changes Sep 16, 2023

View reviewed changes

ansible/group_vars/boot.yml Show resolved Hide resolved

ansible/group_vars/boot.yml Show resolved Hide resolved

ansible/group_vars/boot.yml Show resolved Hide resolved

ansible/main.yml Show resolved Hide resolved

yakimant closed this Sep 18, 2023

add boot hosts #1

add boot hosts #1

Conversation

yakimant commented Sep 8, 2023

yakimant commented Sep 11, 2023

yakimant commented Sep 11, 2023 • edited Loading

yakimant commented Sep 11, 2023

yakimant commented Sep 11, 2023 • edited Loading

yakimant commented Sep 11, 2023

yakimant commented Sep 12, 2023 • edited Loading

yakimant commented Sep 12, 2023 • edited Loading

yakimant commented Sep 12, 2023

yakimant commented Sep 12, 2023 • edited Loading

yakimant commented Sep 12, 2023 • edited Loading

yakimant commented Sep 13, 2023

yakimant commented Sep 13, 2023 • edited Loading

yakimant commented Sep 13, 2023

jakubgs commented Sep 13, 2023 • edited Loading

jakubgs commented Sep 13, 2023 • edited Loading

yakimant commented Sep 14, 2023

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023

yakimant commented Sep 14, 2023 • edited Loading

jakubgs commented Sep 14, 2023 • edited Loading

jakubgs commented Sep 14, 2023

yakimant commented Sep 14, 2023

yakimant commented Sep 14, 2023 • edited Loading

yakimant commented Sep 14, 2023

yakimant commented Sep 15, 2023 • edited Loading

yakimant commented Sep 15, 2023 • edited Loading

yakimant commented Sep 15, 2023 • edited Loading

yakimant commented Sep 15, 2023

yakimant commented Sep 15, 2023 • edited Loading

yakimant commented Sep 15, 2023

yakimant commented Sep 15, 2023

jakubgs left a comment

Choose a reason for hiding this comment

yakimant commented Sep 18, 2023 • edited Loading

yakimant commented Sep 11, 2023 •

edited

Loading

yakimant commented Sep 11, 2023 •

edited

Loading

yakimant commented Sep 12, 2023 •

edited

Loading

yakimant commented Sep 12, 2023 •

edited

Loading

yakimant commented Sep 12, 2023 •

edited

Loading

yakimant commented Sep 12, 2023 •

edited

Loading

yakimant commented Sep 13, 2023 •

edited

Loading

jakubgs commented Sep 13, 2023 •

edited

Loading

jakubgs commented Sep 13, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

jakubgs commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 14, 2023 •

edited

Loading

yakimant commented Sep 15, 2023 •

edited

Loading

yakimant commented Sep 15, 2023 •

edited

Loading

yakimant commented Sep 15, 2023 •

edited

Loading

yakimant commented Sep 15, 2023 •

edited

Loading

yakimant commented Sep 18, 2023 •

edited

Loading