Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add boot hosts #1

Closed
wants to merge 1 commit into from
Closed

add boot hosts #1

wants to merge 1 commit into from

Conversation

yakimant
Copy link
Member

@yakimant yakimant commented Sep 8, 2023

No description provided.

@yakimant yakimant marked this pull request as draft September 8, 2023 13:49
@yakimant
Copy link
Member Author

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
export BW_SESSION=SMTH

@yakimant
Copy link
Member Author

yakimant commented Sep 11, 2023

Current issues:

  • SSH access issues for DO and AC, but not GC
  • iptables issue: Set sshguard4 doesn't exist
    • not an issue after ansible/requirements.yml update

@yakimant
Copy link
Member Author

SSH keys setup:

# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
      + ssh_keys             = [
          + "20671731",
        ]

# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
      + key_name                           = "jakubgs"

@yakimant
Copy link
Member Author

yakimant commented Sep 11, 2023

For DO we can do it:
status-im/infra-tf-digital-ocean#1

For AC, looks like one key allowed only:
https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance

As an alternative:

  • we could configure it as variable in infra-tf-multi-provider, so each devops override it locally this will not work, key pair is once and forever
  • have a share ssh key (security risks)
  • run ansible only on CI with shared key

Proper solution was to change ansible role locally.

@yakimant
Copy link
Member Author

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic:
https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7

@yakimant
Copy link
Member Author

yakimant commented Sep 12, 2023

@yakimant
Copy link
Member Author

yakimant commented Sep 12, 2023

New issues:

  • AC: infra-role-bootstrap-linux : Make sure essential pip packages are installed TAGS: [role::bootstrap:packages] fails on with AttributeError: cython_sources
  • DO: infra-role-bootstrap-linux/raw : Install mandatory packages runs endlessly
    • Stuck on "Scanning processes" phase of apt install:
    apt -y install python3-minimal acl
      `-apt -y install python3-minimal acl
          `-sh -c test -x /usr/lib/needrestart/apt-pinvoke && /usr/lib/needrestart/apt-pinvoke || true
              `-frontend -w /usr/share/debconf/frontend /usr/sbin/needrestart
                  |-needrestart /usr/sbin/needrestart
                  `-whiptail --backtitle Package configuration --title Daemons using outdated libraries --output-fd 11 --separate-output --checklist \012\012Which services should be restarted? 12 47 2 -- packagekit.service  on unattended-upgrades.service
    

@yakimant
Copy link
Member Author

whiptail is for dialogs, probably it's waiting for some input

@yakimant
Copy link
Member Author

yakimant commented Sep 12, 2023

@yakimant
Copy link
Member Author

yakimant commented Sep 12, 2023

Other issues

  • infra-role-bootstrap-linux : Docker | Install package failing with:
  • infra-role-bootstrap-linux : Consul | Create consul config directory fails with
    • AnsibleError: An unhandled exception occurred while templating '{{lookup("bitwarden", "consul/cluster", field="encryption-key")}}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'bitwarden'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0). Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0)
      
    • bw unlock and export helped
  • infra-role-bootstrap-linux/raw : Install mandatory packages returned on DO:
    • E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_multiverse_cnf_Commands-amd64 - open (2: No such file or directory)
      
    • │ E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 - open (2: No such file or directory)
      ...
      │ E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
      │ E: Sub-process returned an error code
      
    • looks like a corrupted cache, probably rm -rf /var/lib/apt/lists/* && apt update should do the trick
    • retry helps
  • infra-role-bootstrap-linux : Netdata | Restart service:
    • Could not find the requested service netdata: host
      

@yakimant
Copy link
Member Author

netdata.service is not installed:

# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...

Unfornutely it doesn't fail the installation.

@yakimant
Copy link
Member Author

yakimant commented Sep 13, 2023

This code fails to detect systemd:
https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

@yakimant
Copy link
Member Author

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64

Looks like they reverence the cloud specific repo mirrors.

@jakubgs
Copy link
Member

jakubgs commented Sep 13, 2023

I don't get your issue with Netdata, the _check_systemd function works fine:

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh

_check_systemd() {
  pids=''
  p=''
  myns=''
  ns=''

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0
  fi

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh 
YES

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

@jakubgs
Copy link
Member

jakubgs commented Sep 13, 2023

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

@yakimant
Copy link
Member Author

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import functions.sh and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

  • Another minor issue:
    fuser exits with 1 if files not open by other cpu or one of the files doesn't exit:
# fuser /var/cache/fwupd/metadata.xmlb
/var/cache/fwupd/metadata.xmlb:  6080m
# echo $?
0

# fuser /var/cache/fwupd/noneexist
Specified filename /var/cache/fwupd/noneexist does not exist.
# echo $?
1

# fuser /var/cache/apt/archives/lock
# echo $?
1

I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

  • You need to install "jmespath" prior to running json_query filter
│ TASK [infra-role-bootstrap-linux : Volume | Identify device without partitions] ***
│ fatal: [8.218.174.108]: FAILED! => {}
│
│ MSG:
│
│ You need to install "jmespath" prior to running json_query filter

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Fix:
Install on the controller node:

pip install jmespath

Follow-up:
Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

@yakimant
Copy link
Member Author

Alibaba Cloud images:

shards.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.shards:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

More on the netdata installation:

They even have a community supported playbook:
https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible
which runs the kickstart.sh script which will likely install deb from a repo.

The most popular role from Galaxy:
https://github.com/mrlesmithjr/ansible-netdata/
runs netdata-installer.sh

Why they are so obsessed with installer scripts?

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

  • ssh fingerprint issue duting the 'role::bootstrap:users tasks.
    Can happen on different steps, eg:
TASK [infra-role-bootstrap-linux : Create users groups] ************************
│ fatal: [8.218.174.108]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "8.218.174.108". Make sure this host can be reached over ssh: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
│ The fingerprint for the ED25519 key sent by the remote host is
│ SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
│ Please contact your system administrator.
│ Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
│ Offending ED25519 key in /Users/status/.ssh/known_hosts:169
│ Agent forwarding is disabled to avoid man-in-the-middle attacks.
│ UpdateHostkeys is disabled because the host key is not trusted.
│ root@8.218.174.108: Permission denied (publickey).

Reproduced:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Workaround:
Rerun Ansible without recreating an instance.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

@yakimant
Copy link
Member Author

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [8.218.174.108] => {
     "changed": true,
     "rc": 0
 }

 STDERR:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is
 SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to 8.218.174.108 closed.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd

No (deleted), so parsed properly.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

@yakimant
Copy link
Member Author

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

  • ssh connection refused on GC:
│ TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Failed to connect to the host via ssh: ssh: connect to host 34.135.13.87 port 22: Connection refused

Reproduce:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround:
Ansible rerun helps

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2023

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2023

Also, I would recommend keeping research like this in the issue, and not in the PR.

@yakimant
Copy link
Member Author

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday.
I just post whatever issues I encounter and rerun Ansible, which helps so far.

@yakimant
Copy link
Member Author

yakimant commented Sep 14, 2023

  • ssh Permission denied issue during the role::bootstrap:users tasks on GC
│ TASK [infra-role-bootstrap-linux : Kill ubuntu user processes] *****************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "34.135.13.87". Make sure this host can be reached over ssh: admin@34.135.13.87: Permission denied (publickey).

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user:
https://github.com/status-im/infra-role-bootstrap-linux/pull/34

@yakimant
Copy link
Member Author

I will create proper Issues afterwards as a follow-up.

@yakimant
Copy link
Member Author

yakimant commented Sep 15, 2023

  • Trying to find an image, which supports this change:
    status-im/infra-role-nim-waku@0de1508

  • statusteam/nim-waku:deploy-wakuv2-shards

  • statusteam/nim-waku:deploy-wakuv2-test
    doesn't support:

Unrecognized option 'pubsub-topic'
Try wakunode2 --help for more information.

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

@yakimant
Copy link
Member Author

yakimant commented Sep 15, 2023

  • waku-peers fails to start:
$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] Connecting to Consul: localhost:8500
[INFO] Found 5 data centers.
[DEBUG] Querying: nim-waku (dc=do-ams3, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Found: boot-01.do-ams3.shards.test (env:shards,stage:test,nim,waku,libp2p)
[DEBUG] Querying: nim-waku (dc=aws-eu-central-1a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=he-eu-hel1, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=gc-us-central1-a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=ac-cn-hongkong-c, node_meta={'env': 'shards', 'stage': 'test'})
[INFO] Found 0 services.
Traceback (most recent call last):
  File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
    main()
  File "/usr/local/bin/connect_waku_peers.py", line 125, in main
    raise Exception('No services found')
Exception: No services found

probably because no other nodes are started, will setup others now

@yakimant
Copy link
Member Author

yakimant commented Sep 15, 2023

  • another issue with waku-peers:
$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug

[DEBUG] RPC Call URL: http://localhost:8545
[DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0}
Traceback (most recent call last):
  File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
    main()
  File "/usr/local/bin/connect_waku_peers.py", line 142, in main
    raise Exception('RPC Error: %s' % rval['error'])
Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31'}

Disabled the role, but probably it will fire up later

@yakimant yakimant marked this pull request as ready for review September 15, 2023 12:41
@yakimant
Copy link
Member Author

  • loop_control issues, label should be a string:
❯ ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "open-ports" -i ansible/inventory/test -v
Using /Users/status/work/infra-shards/ansible.cfg as config file
ERROR! The field 'label' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleMapping'>

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  loop_control:
    label:
    ^ here

https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23

Removed loop_control as a workaround

@yakimant
Copy link
Member Author

yakimant commented Sep 15, 2023

  • nim_waku_node_key extraction from file files if already created and not setup by variable
ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "nim-waku" -i ansible/inventory/test -v
...
TASK [nim-waku : Generate random node key] ***********************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "not key_file.stat.exists and nim_waku_node_key is not defined\n",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Save generate node key to file] *****************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "not key_file.stat.exists",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Load existing node key from file] ***************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
    "changed": false,
    "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n",
    "skip_reason": "Conditional result was False"
}

TASK [nim-waku : Extract the node key from file] *****************************************************************
fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Extract the node key from file
  ^ here

Load is skipped wrongly, because generation is skipped.
Maybe it should be the opposite? If generation is skipped - load from file.

https://github.com/status-im/infra-role-nim-waku/blob/75fa7e483cacccb482c99afddc7de3c25fb8a1fc/tasks/nodekey.yml#L31-L37

@yakimant
Copy link
Member Author

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package.
Will check in the related issue.

@yakimant
Copy link
Member Author

Potential temporary workaround for netdata:
copy
/opt/netdata/system/netdata.service
to
/lib/systemd/system/netdata.service

Copy link
Member

@jakubgs jakubgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this up into separate PRs, partially because you filled this PR with comments that are research/debug comments, and partially because you are doing a LOT in one commit. I'd say PRs are for discussing the changes when reviewing, and issues are for your own research and debugging. That way a person that wants to review a PR doesn't get greeted by a wall of text of 40+ comments that have no relevance to their review.

My recommendation would be:

  • Remove commented out stuff unless it actually is relevant to future work needed.
  • Separate out general terraform changes into one commit, like secrets setup or provider setup.
  • Separate out into another commit the creation of the fleet and its emergency inventory.
  • Separate out the Ansible playbook and group variable changes.

Also, I would start without any bootstrap__active_extra_users, and grant them on case-by-case basis as they request it. Unless you were already told to grant the same access as from some other fleet.

ansible/group_vars/boot.yml Show resolved Hide resolved
ansible/group_vars/boot.yml Show resolved Hide resolved
ansible/group_vars/boot.yml Show resolved Hide resolved
ansible/main.yml Show resolved Hide resolved
@yakimant yakimant closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants