Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728

Closed
2 tasks done
AlexZhenWang opened this issue Dec 17, 2023 · 33 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@AlexZhenWang
Copy link
Contributor

AlexZhenWang commented Dec 17, 2023

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Hi, I am trying to upgrade our both Polkadot and Kusama validator to v1.5.0. But I got an Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0.

After downgrading back to v1.4.0, the issue gone. This error happened on both Polkadot and Kusama validator.
The logs:
image


Update:
logs

2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 This chain is not in any way    
2023-12-18 01:10:33       endorsed by the           
2023-12-18 01:10:33      KUSAMA FOUNDATION          
2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 Parity Polkadot    
2023-12-18 01:10:33 ✌️  version 1.5.0-a3dc2f15f23    
2023-12-18 01:10:33 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-12-18 01:10:33 📋 Chain specification: Kusama    
2023-12-18 01:10:33 🏷  Node name: OnfinalityV#1    
2023-12-18 01:10:33 👤 Role: AUTHORITY    
2023-12-18 01:10:33 💾 Database: RocksDb at /chain-data/chains/ksmcc3/db/full    
2023-12-18 01:11:01 🏷  Local node identity is: 12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:01 🔍 Discovered new external address for our node: /ip4/142.215.53.19/tcp/30000/p2p/12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:08 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"    
2023-12-18 01:11:08 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"    
2023-12-18 01:11:09 💻 Operating system: linux    
2023-12-18 01:11:09 💻 CPU architecture: x86_64    
2023-12-18 01:11:09 💻 Target environment: gnu    
2023-12-18 01:11:09 💻 CPU: Intel Xeon Processor (Cascadelake)    
2023-12-18 01:11:09 💻 CPU cores: 16    
2023-12-18 01:11:09 💻 Memory: 32115MB    
2023-12-18 01:11:09 💻 Kernel: 5.4.0-89-generic    
2023-12-18 01:11:09 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2023-12-18 01:11:09 💻 Virtual machine: yes    
2023-12-18 01:11:09 📦 Highest known block at #21034600    
2023-12-18 01:11:09 Running JSON-RPC server: addr=127.0.0.1:9900, allowed origins=["*"]    
2023-12-18 01:11:09 🏁 CPU score: 1.02 GiBs    
2023-12-18 01:11:09 🏁 Memory score: 4.26 GiBs    
2023-12-18 01:11:09 🏁 Disk score (seq. writes): 626.57 MiBs    
2023-12-18 01:11:09 🏁 Disk score (rand. writes): 327.96 MiBs    
2023-12-18 01:11:09 ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 4.26 GiBs), Seq Write(expected: 950.00 MiBs, found: 626.57 MiBs), Rnd Write(expected: 420.00 MiBs, found: 327.96 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2023-12-18 01:11:09 👶 Starting BABE Authorship worker    
2023-12-18 01:11:09 🚨 Your system cannot securely run a validator. 
Running validation of malicious PVF code has a higher risk of compromising this machine.
  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
  - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-validation"
2023-12-18 01:11:09 subsystem finished unexpectedly subsystem=Ok(())
2023-12-18 01:11:09 Received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="pvf-checker"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-signing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-rx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-backing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-tx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-recovery"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="runtime-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collation-generation"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="gossip-support"
2023-12-18 01:11:09 Conclude
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-store"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="provisioner"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="dispute-distribution"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-selection"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="statement-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collator-protocol"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-voting"
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/172.17.0.1/tcp/30333    
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/142.215.53.19/tcp/30333    
2023-12-18 01:11:10 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-18 01:11:10 Essential task `overseer` failed. Shutting down service.    
2023-12-18 01:11:10 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Error: 
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮                               
   2: polkadot::main::h4cca9d3491727cb7
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h94782a592969dae8
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

parameters:
--chain=polkadot --base-path=/chain-data --rpc-cors=all --port=30333 --unsafe-rpc-external --node-key=<xxx> --rpc-methods=Unsafe --name=<name> --telemetry-url="wss://telemetry-backend.w3f.community/submit 1" --public-addr=/dns4/<xxx>/tcp/23739 --in-peers=100 --in-peers-light=0 --db-cache=512

@AlexZhenWang AlexZhenWang added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Dec 17, 2023
@bkchr
Copy link
Member

bkchr commented Dec 17, 2023

Please provide more logs and also the logs and not a screenshot.

@AlexZhenWang
Copy link
Contributor Author

Please provide more logs and also the logs and not a screenshot.

Thanks for the response @bkchr.
Here are the logs

2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 This chain is not in any way    
2023-12-18 01:10:33       endorsed by the           
2023-12-18 01:10:33      KUSAMA FOUNDATION          
2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 Parity Polkadot    
2023-12-18 01:10:33 ✌️  version 1.5.0-a3dc2f15f23    
2023-12-18 01:10:33 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-12-18 01:10:33 📋 Chain specification: Kusama    
2023-12-18 01:10:33 🏷  Node name: OnfinalityV#1    
2023-12-18 01:10:33 👤 Role: AUTHORITY    
2023-12-18 01:10:33 💾 Database: RocksDb at /chain-data/chains/ksmcc3/db/full    
2023-12-18 01:11:01 🏷  Local node identity is: 12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:01 🔍 Discovered new external address for our node: /ip4/142.215.53.19/tcp/30000/p2p/12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:08 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"    
2023-12-18 01:11:08 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"    
2023-12-18 01:11:09 💻 Operating system: linux    
2023-12-18 01:11:09 💻 CPU architecture: x86_64    
2023-12-18 01:11:09 💻 Target environment: gnu    
2023-12-18 01:11:09 💻 CPU: Intel Xeon Processor (Cascadelake)    
2023-12-18 01:11:09 💻 CPU cores: 16    
2023-12-18 01:11:09 💻 Memory: 32115MB    
2023-12-18 01:11:09 💻 Kernel: 5.4.0-89-generic    
2023-12-18 01:11:09 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2023-12-18 01:11:09 💻 Virtual machine: yes    
2023-12-18 01:11:09 📦 Highest known block at #21034600    
2023-12-18 01:11:09 Running JSON-RPC server: addr=127.0.0.1:9900, allowed origins=["*"]    
2023-12-18 01:11:09 🏁 CPU score: 1.02 GiBs    
2023-12-18 01:11:09 🏁 Memory score: 4.26 GiBs    
2023-12-18 01:11:09 🏁 Disk score (seq. writes): 626.57 MiBs    
2023-12-18 01:11:09 🏁 Disk score (rand. writes): 327.96 MiBs    
2023-12-18 01:11:09 ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 4.26 GiBs), Seq Write(expected: 950.00 MiBs, found: 626.57 MiBs), Rnd Write(expected: 420.00 MiBs, found: 327.96 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2023-12-18 01:11:09 👶 Starting BABE Authorship worker    
2023-12-18 01:11:09 🚨 Your system cannot securely run a validator. 
Running validation of malicious PVF code has a higher risk of compromising this machine.
  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
  - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-validation"
2023-12-18 01:11:09 subsystem finished unexpectedly subsystem=Ok(())
2023-12-18 01:11:09 Received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="pvf-checker"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-signing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-rx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-backing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-tx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-recovery"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="runtime-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collation-generation"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="gossip-support"
2023-12-18 01:11:09 Conclude
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-store"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="provisioner"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="dispute-distribution"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-selection"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="statement-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collator-protocol"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-voting"
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/172.17.0.1/tcp/30333    
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/142.215.53.19/tcp/30333    
2023-12-18 01:11:10 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-18 01:11:10 Essential task `overseer` failed. Shutting down service.    
2023-12-18 01:11:10 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Error: 
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮                               
   2: polkadot::main::h4cca9d3491727cb7
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h94782a592969dae8
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

And here are the parameters that I used when running the validator
--chain=polkadot --base-path=/chain-data --rpc-cors=all --port=30333 --unsafe-rpc-external --node-key=<xxx> --rpc-methods=Unsafe --name=<name> --telemetry-url="wss://telemetry-backend.w3f.community/submit 1" --public-addr=/dns4/<xxx>/tcp/23739 --in-peers=100 --in-peers-light=0 --db-cache=512

@alexggh
Copy link
Contributor

alexggh commented Dec 18, 2023

This is the root cause:

 - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }

I think you are hiting the same problem as here: #2662

@mrcnski
Copy link
Contributor

mrcnski commented Dec 18, 2023

Yes, looks like some new security features couldn't be enabled. "Operation not permitted" is interesting. Can you share details of your setup, is there anything unusual about it? Is your database path on a mount or have any special restrictions?

By the way, upgrading to Linux 5.13+ would make this part of the error go away, and the other part (unshare) becomes optional:

  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced

If upgrading is not possible you can pass the CLI flag specified in the error.

bkontur added a commit that referenced this issue Dec 18, 2023
68d8650 Bump thiserror from 1.0.50 to 1.0.51
009c989 remove no longer valid check from the ensure_weights_are_correct (#2740)
94c44a7 Added Rococo BH <> Rococo Bulletin bridge (#2724)
5fe0f2f Bump tokio from 1.34.0 to 1.35.0
25f8251 Grafana update stuff (#2733)
06fbe8b Improved `ExportXcm::validate` implementation for BridgeHubs - step 1 (#2727)
390e836 Select header that will be fully refunded in on-demand batch finality relay (#2729)
ce701dd separate constants for average and worst case relay headers (#2728)
09215c5 Backport from `polkadot-sdk` + bump (#2725)
6327261 Bump serde from 1.0.192 to 1.0.193
fff9ddd Bump sysinfo from 0.29.10 to 0.29.11
4be99fe Monitoring and alerts for Rococo/Westend (#2710)
67a683a Bump ed25519-dalek from 2.0.0 to 2.1.0
8e0e794 quick and dirty fix for the `wait -p` and older distros (#2712)
3ab6562 Add withdraw reserve assets to zombienet tests (#2711)
c2c409b increase init timeouts in zombienet tests (#2706)
a8c60b4 fix lane id and bridged chain id (#2705)
9ac0f26 removed bp-asset-hub-kusama and bp-asset-hub-polkadot (#2703)
4916475 Some fixes for zombienet tests (polkadot-staging) (#2704)
6f9a147 zombienet from Wococo to Westend (#2699)
3ba7910 Porting changes from polkadot-sdk to polkadot-staging - before update subtree with removed wococo stuff (#2696)
653448f Remove Woococo related stuff (#2692)
03aaab2 Gitspiegel polkadot staging (#2695)
702a4c1 Drop Rialto <> Millau bridges (#2663) (#2694)
6a63b5f Start version guards for the ED loop (#2678)
896b9a9 typo (#2690)
671d27c Bump serde from 1.0.190 to 1.0.192
991b229 Bump clap from 4.4.7 to 4.4.8
ec267ec Bump env_logger from 0.10.0 to 0.10.1
592e407 Bump tokio from 1.33.0 to 1.34.0
c49ce3d Bump serde_json from 1.0.107 to 1.0.108
04b3319 Update subxt-codegen version (#2674)
03f9804 backport #2139 (#2673)
49245dd removed unused PARACHAINS_FINALITY_PALLET_NAME constant (#2670)
658a3f5 BHR/BHWE spec_version according to the `polkadot-sdk` (#2668)
7666b94 Nit from `polkadot-sdk` (#2665)
b5c43bb Adjusted constant because for measuring we used mistakenly rococo constants (#2664)
062449d Add Rococo<>Westend bridge support/relay (#2647)
55eb44e Add basic zombienet test to be used in the future (#2649) (#2660)
93b6b3f Bump clap from 4.4.6 to 4.4.7
4c01ab0 Bump futures from 0.3.28 to 0.3.29
a31a6c0 Bump tempfile from 3.8.0 to 3.8.1
bcdfe83 Bump serde from 1.0.189 to 1.0.190
f7433b0 Port #2648 to polkadot-staging (#2651)
3896738 Bump scale-info from 2.9.0 to 2.10.0
12d62c5 Bump thiserror from 1.0.49 to 1.0.50
1d78aa1 Backport from `polkadot-sdk` with actual master (#2633)
ab4de94 Grandpa justifications: Avoid duplicate vote ancestries (#2634) (#2635)
465562a add missing crate descriptions (#2629)
28d3680 Bump fixed-hash
67528c4 Bump serde from 1.0.188 to 1.0.189
d450c47 Bump time from 0.3.29 to 0.3.30
6a19f83 Bump async-trait from 0.1.73 to 0.1.74
a92d213 Millau, Rialto: accept equivocation reports (#2614) (#2617)
a61f777 Bump tokio from 1.32.0 to 1.33.0
0052f64 Bump subxt from 0.32.0 to 0.32.1
ccc849d Bump num-traits from 0.2.16 to 0.2.17
22f2752 apply late suggestions for #2600 (#2603)
0320172 actualize check_obsolete_call comment (#2601)
5cbbd25 Reject transactions if bridge pallets are halted (#2600)
ca4dfe3 Bump subxt from 0.31.0 to 0.32.0
8bf7b58 Bump clap from 4.4.4 to 4.4.6
88b0b99 Bump thiserror from 1.0.48 to 1.0.49
263833b https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3833103 (#2589)
4f44968 Backport changes from polkadot-sdk (#2588)
7200ed1 fiox overflow when computing priority boost (#2587)
e02cbd3 Bump time from 0.3.28 to 0.3.29
a097dd2 Bump clap from 4.4.3 to 4.4.4
801ce88 Merge bulletin chain changes into polkadot staging (#2574)
a3803ce Add unit tests for the equivocation detection loop (#2571)
26dfc31 Bump clap from 4.4.2 to 4.4.3
66a8beb Bump serde_json from 1.0.106 to 1.0.107
18c50da Bump trie-db from 0.27.1 to 0.28.0
4c4fa92 Equivocation detection loop: Reorganize block checking logic as state machine (#2555) (#2557)
6bd317a Bump serde_json from 1.0.105 to 1.0.106
a7e6bfd Backport for polkadot-sdk#1446 (#2546)
d9f8050 Bump sysinfo from 0.29.9 to 0.29.10
901f44c Bump thiserror from 1.0.47 to 1.0.48
82eeb50 Bump sysinfo from 0.29.8 to 0.29.9
a0c934b Bump strum from 0.24.1 to 0.25.0
1064fbf Bump subxt from 0.28.0 to 0.31.0
e50398d bridges subtree fixes (#2528)
99af075 Markdown linter (#1309) (#2526)
733ff0f `polkadot-staging` branch: Use polkadot-sdk dependencies (#2524)
e8a59f1 Fix benchmark with new XCM::V3 `MAX_INSTRUCTIONS_TO_DECODE` (#2514)
62b185d Backport `polkadot-sdk` changes to `polkadot-staging` (#2518)
d9658f4 Fix equivocation detection containers startup (#2516) (#2517)
d65db28 Backport: building images from locally built binaries (#2513)
5fdbaf4 Start the equivocation detection loop from the complex relayer (#2507) (#2512)
7fbb67d Backport: Implement basic equivocations detection loop (#2375)
cb7efe2 Manually update deps in polkadot staging (#2371)
d17981f #2351 to polkadot-staging (#2359)

git-subtree-dir: bridges
git-subtree-split: 68d8650
@AlexZhenWang
Copy link
Contributor Author

AlexZhenWang commented Dec 19, 2023

Yes, looks like some new security features couldn't be enabled. "Operation not permitted" is interesting. Can you share details of your setup, is there anything unusual about it? Is your database path on a mount or have any special restrictions?

By the way, upgrading to Linux 5.13+ would make this part of the error go away, and the other part (unshare) becomes optional:

  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced

If upgrading is not possible you can pass the CLI flag specified in the error.

Thanks for the reply! @mrcnski

I am running the node as a pod in a k8s cluster. And the database is in a PVC that is mounted on the pod.

The related settings for the pod:

...
spec:
  containers:
  - args:
    - --base-path=/chain-data
...
    volumeMounts:
    - mountPath: /chain-data
      name: chaindata-volume
...
  volumes:
  - name: chaindata-volume
    persistentVolumeClaim:
      claimName: pvc-0
 ...

@mrcnski
Copy link
Contributor

mrcnski commented Dec 19, 2023

I am completely unfamiliar with kubernetes, but I presume the node is running in a container. That is probably why certain operations are not allowed, and maybe it depends on the container settings. For example if there is a seccomp sandbox it could be blocking the syscall, but I think this can be turned off. What is your Linux kernel version?

@AlexZhenWang
Copy link
Contributor Author

Thanks @mrcnski. Yeah, I think you are right. Here is my Linux kernel version

5.4.0-89-generic

@mrcnski
Copy link
Contributor

mrcnski commented Dec 20, 2023

Thanks @AlexZhenWang! If it's possible, you can upgrade to Linux 5.13+; you'll still get a warning due to running in a container, but it won't be a hard error. Otherwise you'll need to pass --insecure-validator-i-know-what-i-do. Appreciate the report, it's important for us to know what is supported by actual configurations in production usage.

@bkchr
Copy link
Member

bkchr commented Feb 20, 2024

#2486 (comment) posting this here as I think it belongs here.

CC @s0me0ne-unkn0wn @matthewmarcus

@maksimryndin
Copy link
Contributor

#2486 (comment) posting this here as I think it belongs here.

CC @s0me0ne-unkn0wn @matthewmarcus

@matthewmarcus Hi! Could you please provide an output/answer for the following commands/questions

  1. uname -a
  2. more detailed logs output from the console if possible (as a text like this Got Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728 (comment))
  3. in the terminal on the machine
      echo "      #include <fcntl.h>
      #include <linux/landlock.h>
      #include <sys/syscall.h>
      #include <unistd.h>

      int main (int argc, char *argv[]) {
        /* Get supported landlock ABI */
        int abi = syscall (SYS_landlock_create_ruleset, NULL, 0, LANDLOCK_CREATE_RULESET_VERSION);
        return abi < 0 ? 1 : 0;
      }
     " > landlock_test.c

clang landlock_test.c -o landlock_test

./landlock_test

echo $? <--- what is an output? is it 0?

Thank you!

@s0me0ne-unkn0wn
Copy link
Contributor

Landlock is optional, anyway. The main problem to me is that unshare() fails, and clone() fails as a consequence. One obvious reason for failing unshare() is running the node inside a chroot environment or another type of jail. If that's not the case, I'm out of ideas right now, but I'll look into the kernel code a bit later to learn why it may fail.
CC @koute just in case

@matthewmarcus
Copy link

#2486 (comment) posting this here as I think it belongs here.
CC @s0me0ne-unkn0wn @matthewmarcus

@matthewmarcus Hi! Could you please provide an output/answer for the following commands/questions

  1. uname -a
  2. more detailed logs output from the console if possible (as a text like this Got Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728 (comment))
  3. in the terminal on the machine
      echo "      #include <fcntl.h>
      #include <linux/landlock.h>
      #include <sys/syscall.h>
      #include <unistd.h>

      int main (int argc, char *argv[]) {
        /* Get supported landlock ABI */
        int abi = syscall (SYS_landlock_create_ruleset, NULL, 0, LANDLOCK_CREATE_RULESET_VERSION);
        return abi < 0 ? 1 : 0;
      }
     " > landlock_test.c

clang landlock_test.c -o landlock_test

./landlock_test

echo $? <--- what is an output? is it 0?

Thank you!

Hey. Thanks for the reply.

Here are the results:

uname -r = 6.7.5-060705-generic

When attempting to run clang landlock_test.c -o landlock_test I get this error:

landlock_test.c:2:16: fatal error: 'linux/landlock.h' file not found
#include <linux/landlock.h>
^~~~~~~~~~~~~~~~~~
1 error generated.

@s0me0ne-unkn0wn
Copy link
Contributor

@matthewmarcus what Linux distribution is that and what architecture you're running on?

@matthewmarcus
Copy link

matthewmarcus commented Feb 20, 2024

@matthewmarcus what Linux distribution is that and what architecture you're running on?

@s0me0ne-unkn0wn

Ubuntu 20.04

Intel NUC w/ Core i7-8559U processor

@s0me0ne-unkn0wn
Copy link
Contributor

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

@s0me0ne-unkn0wn
Copy link
Contributor

@matthewmarcus I haven't been using Ubuntu for quite some time now, but from what I googled quickly, for 20.04 the supported version in the HWE stack is 5.15, and the 6.7.5 most probably comes from mainline builds. The mainline builds are not supported, not guaranteed to work, and not recommended for production use. I don't say it's definitely a problem, but if you could try to run your node after booting from the officially supported 5.15 kernel from Ubuntu distro, you could probably save us a lot of debugging time :)

I personally run 6.7.0 from the Manjaro distro, and I don't have any issues with secure validator mode, but that's not exactly the same as the mainline builds.

@maksimryndin
Copy link
Contributor

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

Yeah, I've spawned a test Ubuntu 20.04 amd64 machine and the latest available through the apt is linux-image-5.8.0-63-lowlatency/focal-updates,focal-security 5.8.0-63.71~20.04.1 amd64. Perhaps, is it some manually built unsigned image and that's why the secure boot is disabled? @matthewmarcus

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin how about apt install --install-recommends linux-generic-hwe-20.04? In theory, it should get you 5.15, at least if you're on the latest 20.04.5 LTS update

@maksimryndin
Copy link
Contributor

apt install --install-recommends linux-generic-hwe-20.04

@s0me0ne-unkn0wn yeah, you're right :) 5.15. Exactly!

@matthewmarcus
Copy link

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

I used the ubuntu-mainline-kernel.sh script as described in this post:

https://askubuntu.com/questions/1388115/how-do-i-update-my-kernel-to-the-latest-one

@matthewmarcus
Copy link

matthewmarcus commented Feb 21, 2024

@matthewmarcus I haven't been using Ubuntu for quite some time now, but from what I googled quickly, for 20.04 the supported version in the HWE stack is 5.15, and the 6.7.5 most probably comes from mainline builds. The mainline builds are not supported, not guaranteed to work, and not recommended for production use. I don't say it's definitely a problem, but if you could try to run your node after booting from the officially supported 5.15 kernel from Ubuntu distro, you could probably save us a lot of debugging time :)

I personally run 6.7.0 from the Manjaro distro, and I don't have any issues with secure validator mode, but that's not exactly the same as the mainline builds.

Well, the kernel we were using prior to 6.7.5 was 5.15.0-88-generic and that was giving us the same errors (see #2486 (comment)). So unless one of the minor builds after 5.15.0-88 fixed the issue, the 5.15 kernel isn't working either.

@matthewmarcus
Copy link

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

Yeah, I've spawned a test Ubuntu 20.04 amd64 machine and the latest available through the apt is linux-image-5.8.0-63-lowlatency/focal-updates,focal-security 5.8.0-63.71~20.04.1 amd64. Perhaps, is it some manually built unsigned image and that's why the secure boot is disabled? @matthewmarcus

That's interesting b/c we've never manually updated the kernel on this machine since originally installing Ubuntu 20.04, and the kernel it chose (?) to use was 5.15.0-88-generic. I did notice there were a boat load of other kernels on the box as well, but I removed them in an attempt to free up some disk space. Removing them, tho, did not free up any disk space. :) @maksimryndin

@matthewmarcus
Copy link

matthewmarcus commented Feb 21, 2024

@maksimryndin how about apt install --install-recommends linux-generic-hwe-20.04? In theory, it should get you 5.15, at least if you're on the latest 20.04.5 LTS update

Just ran the lsb_release -a command on our box and it revealed:

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Sharing in case it helps with debugging. @s0me0ne-unkn0wn

Also, ran this command sudo apt install --install-recommends linux-generic-hwe-20.04 which resulted in these errors:

Reading package lists... Done
Building dependency tree
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
linux-generic-hwe-20.04 : Depends: linux-image-generic-hwe-20.04 (= 5.15.0.94.104~20.04.50) but it is not going to be installed
Depends: linux-headers-generic-hwe-20.04 (= 5.15.0.94.104~20.04.50) but it is not going to be installed linux-headers-6.7.5-060705-generic : Depends: libc6 (>= 2.38) but 2.31-0ubuntu9.14 is to be installed
Depends: libssl3 (>= 3.0.0) but it is not installable
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

@matthewmarcus
Copy link

Did I scare everyone away? @s0me0ne-unkn0wn @maksimryndin

@s0me0ne-unkn0wn
Copy link
Contributor

@matthewmarcus what does the OS-suggested apt --fix-broken install --install-recommends linux-generic-hwe-20.04 propose? Does it look like it wants to break your system completely if applied? 🙃

Honestly, I'd try to install Ubuntu from scratch, not using mainline builds and using the supported HWE stack. There's nothing special about NUC hardware that might prevent proper sandboxing AFAIK (@koute ?) so that's most probably a kernel issue. If you're able to sort it out with --fix-broken, that's okay, but if the system is seriously messed up, it's sometimes just easier to re-install.

@koute
Copy link
Contributor

koute commented Feb 25, 2024

There's nothing special about NUC hardware that might prevent proper sandboxing AFAIK (@koute ?) so that's most probably a kernel issue.

Well, there can be a few reasons why this doesn't work, but AFAIK usually the reason is that the environment is configured to disallow unprivileged users to create user namespaces. Some Linux distribution might be configured in such a way by default, and some containerization software (Docker/Podman/Kubernetes/insert you trendy alternative of the week) might also disallow it.

So the fundamental question here is: does this happen because of how the environment is configured, or does this happen because our code doesn't handle some corner case?

If it's the former - that's an unsupported configuration, and we should document this and tell the users how to fix it. (Ideally we could detect this exact situation and have the node print out a helpful error message.) If it's the latter - we need to fix their code.

Either way the fastest way to investigate and fix it is probably something like this:

  1. Ask the user about their environment (which Linux distro, which exact kernel, bare metal or VM, is it Docker/Podman/Kubernetes/whatever and if it is the exact configuration, etc.).
  2. Replicate the same environment ourselves in a VM.
  3. See why it fails, and either document on how the users can reconfigure their environment or fix our code. (And if it doesn't fail we can then probe the user further to figure out what's different in their environment as opposed to ours, and then rinse and repeat.)

(At least that's what I would do.)

@matthewmarcus
Copy link

apt --fix-broken install --install-recommends linux-generic-hwe-20.04

@s0me0ne-unkn0wn I'm out of town at the moment, and don't want to issue that command until I return (this Wed) just in case it breaks the entire system. I'll let you know when I try and report back.

As for reinstalling Ubuntu, I have several nodes/systems running on this platform so doing that would be a real undertaking and result in significant down time. I would only want to do that as a last resort.

@maksimryndin has reached out and we're gonna look at the problem together in the coming days. If we figure anything out, we'll be sure to let you know.

@maksimryndin
Copy link
Contributor

So with @matthewmarcus we have figured out the actual reason - polkadot ran behind too restrictive systemd unit configuration (which also turned off an ability to create namespaces). Nothing special about the system itself.

Linux Good-KarMa 6.7.5-060705-generic #202402161836 SMP PREEMPT_DYNAMIC Fri Feb 16 19:10:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Matthew had created that restrictive systemd service way before the introduced security features for pvf. And when he updated for the release with security features added, they couldn't be enabled

We turned off systemd restrictions in favor of native polkadot security mechanisms and everything works as expected.
@s0me0ne-unkn0wn @bkchr @koute

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin oh wow, thanks a lot for investigating that! Can you please elaborate on what restrictions were in force? It's probably worth mentioning in the documentation to avoid other users hitting the problem.

@matthewmarcus
Copy link

matthewmarcus commented Mar 2, 2024

Yes! Many many thanks to @maksimryndin for his excellent guidance and support today. We spent several hours attempting to figure out the issue only to find, as was mentioned, my systemd config for the service was much too restrictive.

@s0me0ne-unkn0wn

Here is a portion of the config file. As you can see, once we commented all of the unnecessary parameters, the service worked perfectly.

[Service]
User=polkadot
Group=polkadot
Type=simple
Restart=always
RestartSec=120
# MemoryHigh=5400M
# MemoryMax=5500M
# CapabilityBoundingSet=
# LockPersonality=true
# NoNewPrivileges=true
# PrivateDevices=true
# PrivateMounts=true
# PrivateTmp=true
# PrivateUsers=true
# ProtectClock=true
# ProtectControlGroups=true
# ProtectHostname=true
# ProtectKernelModules=true
# ProtectKernelTunables=true
# ProtectSystem=strict
# ReadWritePaths=/media/maxdrive/polkadot
# RemoveIPC=true
# RestrictAddressFamilies=AF_INET AF_INET6 AF_NETLINK AF_UNIX
# RestrictNamespaces=true
# RestrictSUIDSGID=true
# SystemCallArchitectures=native
# SystemCallFilter=@system-service
# SystemCallFilter=landlock_add_rule landlock_create_ruleset landlock_restrict_self seccomp
# SystemCallFilter=@sandbox
# SystemCallFilter=@obsolete
# SystemCallFilter=seccomp
# SystemCallErrorNumber=EPERM
# SystemCallFilter=~@clock @module @mount @reboot @swap @privileged
# UMask=0027

@maksimryndin
Copy link
Contributor

maksimryndin commented Mar 2, 2024

@maksimryndin oh wow, thanks a lot for investigating that! Can you please elaborate on what restrictions were in force? It's probably worth mentioning in the documentation to avoid other users hitting the problem.

So I would advise users in case of a similar issue try to check and to turn off systemd security-related settings in favor of native polkadot security features. And I believe we should come up with a standard template for troubleshooting such kind of things (I can try to prepare a testing script and come up with a Github issue template suggestions).

By the way, during our experiments (we tried to run zombienet first to avoid touching a running validator) we encountered an issue (filed here paritytech/zombienet#1737).

So,

  • come up with a simple testing script for security features and a guide for users (@maksimryndin)
  • fix an issue with zombienet default insecure mode for a validator (An option to turn off --insecure-validator-i-know-what-i-do zombienet#1737)
  • I would also suggest including a binary of undying-collator in Polkadot releases to be able to run the minimal zombienet on the target machine to check issues quickly without touching the production validator. What do you think @s0me0ne-unkn0wn @koute @bkchr ? I am not sure it is the right way but at least for me it was a viable solution to investigate.

@General-Beck
Copy link

General-Beck commented Mar 11, 2024

In the case of creating a script for verification, I would recommend the following order:

  1. Checking the version of the current kernel uname -r at the very beginning of the script (for Ubuntu starting with release 20.04.6, 22.04.4, a sufficient condition is the availability of the kernel version 5.15.0-100 and higher)

  2. If necessary, update the kernel to the latest HWE version:
    sudo apt --fix-broken install --install-recommends linux-generic-hwe-20.04 for 20.04.6 and sudo apt --fix-broken install --install-recommends linux-generic-hwe-22.04 for 22.04

  3. Check the correct operation of the Landlock:
    root@beck-home-desktop:/home/denis#sudo dmesg | grep landlock || journalctl -kg landlock
    [0.547818] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,integrity
    [0.547832] landlock: Up and running.

  4. Most likely, warnings such as this will remain at startup:
    mar 11 08:14:44 beck-home-desktop polkadot[11531]: 2024-03-11 08:14:44 🚨 Some security issues have been detected.
    Mar 11 08:14:44 beck-home-desktop polkadot[11531]: Running validation of malicious PVF code has a higher risk of compromising this machine.
    Mar 11 08:14:44 beck-home-desktop polkadot[11531]: - Optional: Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not allowed (os error 1)
    Mar 11 08:14:44 beck-home-desktop polkadot[11531]: - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not allowed
    Mar 11 08:14:44 beck-home-desktop polkadot[11531]: 2024-03-11 08:14:44 👮♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines.
    Mar 11 08:14:44 beck-home-desktop polkadot[11531]: More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
    In this case, you need to fix the systemd polkadot.service startup script, leaving only the following parameters (I think they are enough, since the binary file has its own strict verification):
    root@beck-home-desktop:/home/denis# nano /etc/systemd/system/multi-user.target.wants/polkadot.service
    [Unit]
    Description=Polkadot Node
    After=network.target
    Documentation=https://github.com/paritytech/polkadot
    [Service]
    EnvironmentFile=-/etc/default/polkadot
    ExecStart=/usr/bin/polkadot $POLKADOT_CLI_ARGS
    User=polkadot
    Group=polkadot
    Restart=always
    RestartSec=120
    [Install]
    WantedBy=multi-user.target

  5. Next, make changes to the system and restart the service
    root@beck-home-desktop:/home/denis# systemctl daemon-reload
    root@beck-home-desktop:/home/denis# systemctl restart polkadot
    root@beck-home-desktop:/home/denis# journalctl -fu polkadot
    Mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 👶 Starting BABE Authorization worker
    mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 🥩 BEEFY gadget waiting for BEEFY pallet to become available...
    mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 👮♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines.
    Mar 11 08:23:56 beck-home-desktop polkadot[11804]: More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode

Tested on Ubuntu 20.04.6, 22.04.4, 23.10. polkadot version 1.8.0-ec7817e5ad

bkchr pushed a commit that referenced this issue Apr 10, 2024
* separate constants for average and worst case relay headers

* fix compilation
@acatangiu
Copy link
Contributor

closing as stale

@acatangiu acatangiu closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests

10 participants