Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Parachain and Rococo relaychain randomly crash at 0.9.22 and 0.9.23 #5639

Closed
jasl opened this issue Jun 5, 2022 · 23 comments · Fixed by #5861
Closed

Parachain and Rococo relaychain randomly crash at 0.9.22 and 0.9.23 #5639

jasl opened this issue Jun 5, 2022 · 23 comments · Fixed by #5861

Comments

@jasl
Copy link
Contributor

jasl commented Jun 5, 2022

  • Ubuntu 20.04 latest
  • Polkadot 0.9.23 from official release page
  • Impact Khala parachain node (based on polkadot-v0.9.23 branch) and Polkadot rococo-local (as relaychain for polkadot-launch)
  • I've seen Polakdot (the relaychain) crashed 2 times on our test net, 2 times for my hosted khala-node (connect to live net)
  • The panicked seems randomly, no interesting found in the log, just a crash report
  • First seen is on polkadot-v0.9.22
  • Not sure polkadot-v0.9.20 and below has this issue
    • I got a multi-times random crash report last week (Khala parachain node based on polkadot-v0.9.20 branch, restart works, the user not record any log)

the crash log

2022-06-04 23:45:56 💤 Idle (6 peers), best: #3920 (0x6f21…68db), finalized #3917 (0xcb4c…c62a), ⬇ 3.1kiB/s ⬆ 4.2kiB/s    
2022-06-04 23:46:00 ✨ Imported #3921 (0x34d8…6d9a)    

====================

Version: 0.9.23-a7e188cd966

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:702:17
   2: std::panicking::begin_panic::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: std::panicking::begin_panic
   5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
   6: <libp2p::bandwidth::BandwidthConnecLogging<TInner> as futures_io::if_std::AsyncWrite>::poll_write
   7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
   8: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_flush
   9: <libp2p_noise::io::NoiseOutput<T> as futures_io::if_std::AsyncWrite>::poll_flush
  10: multistream_select::negotiated::Negotiated<TInner>::poll
  11: <multistream_select::negotiated::Negotiated<TInner> as futures_io::if_std::AsyncWrite>::poll_close
  12: yamux::connection::Connection<T>::next_stream::{{closure}}
  13: <futures_util::stream::try_stream::ErrInto<St,E> as futures_core::stream::Stream>::poll_next
  14: <libp2p_core::muxing::Wrap<T> as libp2p_core::muxing::StreamMuxer>::close
  15: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
  16: tokio::runtime::task::harness::poll_future
  17: tokio::runtime::task::raw::poll
  18: tokio::runtime::thread_pool::worker::Context::run_task
  19: tokio::runtime::thread_pool::worker::run
  20: tokio::runtime::task::raw::poll
  21: std::sys_common::backtrace::__rust_begin_short_backtrace
  22: core::ops::function::FnOnce::call_once{{vtable.shim}}
  23: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/boxed.rs:1853:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/boxed.rs:1853:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys/unix/thread.rs:108:17
  24: start_thread
  25: clone


Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/quicksink-0.1.2/src/lib.rs:158

This is a bug. Please report it at:

        https://github.com/paritytech/polkadot/issues/new
@jasl jasl changed the title Parachain and Rococo relaychain randomly crash at 0.9.23 Parachain and Rococo relaychain randomly crash at 0.9.22 and 0.9.23 Jun 5, 2022
@ordian
Copy link
Member

ordian commented Jun 5, 2022

Seems like a panic in libp2p? Could be related to paritytech/substrate#11009? cc @kpp @tomaka

@jasl
Copy link
Contributor Author

jasl commented Jun 5, 2022

If it relates to libp2p, in my local net, I ran 2 parachain nodes in different VMs,
the test net build with polkadot-launch so in 1 VM, it's ran 4 polkadot validators and 3 collators

@kpp
Copy link
Contributor

kpp commented Jun 5, 2022

Seems like this is libp2p/rust-libp2p#2598. cc @mxinden

@mxinden
Copy link
Contributor

mxinden commented Jun 6, 2022

Upstream bug report: libp2p/rust-libp2p#2599

@doutv
Copy link

doutv commented Jun 13, 2022

Same issue when running the Substrate tutorial: https://docs.substrate.io/tutorials/v3/cumulus/start-relay/

Both relay chain validator and parachain node panic.

@NZT48
Copy link

NZT48 commented Jul 24, 2022

I have the same issue after updating parachain to 0.9.26. @jasl have you managed to fix this and run parachain on higher versions, and is this only happened on Rococo connected parachains?

@jasl
Copy link
Contributor Author

jasl commented Jul 25, 2022

I have the same issue after updating parachain to 0.9.26. @jasl have you managed to fix this and run parachain on higher versions, and is this only happened on Rococo connected parachains?

This is a libp2p bug and not fix yet, so latest code affacts too, and not only affact Rococo relaychain, Polkadot & Kusama too, so IMO this issue should be a medium or high important

@kpp
Copy link
Contributor

kpp commented Jul 25, 2022

According to nimiq/core-rs-albatross#732 (comment):

The original issue was observed during low/no available memory left in some old nodes.
It has not been observed with the public and new internal devnet nodes, so Im closing this issue for now.

Do you observe similar things with the given conditions?

@jasl
Copy link
Contributor Author

jasl commented Jul 25, 2022

According to nimiq/core-rs-albatross#732 (comment):

The original issue was observed during low/no available memory left in some old nodes.
It has not been observed with the public and new internal devnet nodes, so Im closing this issue for now.

Do you observe similar things with the given conditions?

For us it might not relate to memory, we have 16GB mem and 16GB swap

image

the problem occured in this server more than 5 times, but latest 1 week is good

I got a bunch of reports from our users, I think I may asking them

@jasl
Copy link
Contributor Author

jasl commented Jul 25, 2022

Ah I forgot I've downgrade to polkadot 0.9.18 so it's good.
@kpp this is typical mem usage of our testnet, should I switch to polkadot 0.9.26 to see whether crash?

@jasl
Copy link
Contributor Author

jasl commented Jul 25, 2022

nimiq/core-rs-albatross#732 (comment)

Phala-Network/khala-parachain#150 (comment)

here's an user replied

CONTAINER ID   NAME         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
948e79c3a2f6   phala-node   106.31%   28.84GiB / 125.7GiB   22.95%    816GB / 4.88TB   2.33TB / 2.99TB   218

@NZT48
Copy link

NZT48 commented Jul 25, 2022

@kpp, @jasl we at OriginTrail, experienced issue on our testnet nodes with Polkadot v0.9.26 and memory usage of around 20%

@mxinden
Copy link
Contributor

mxinden commented Jul 27, 2022

Hi everyone, can someone provide logs in the time range 10 seconds before the panic with debug level (RUST_LOG=debug)? See libp2p/rust-libp2p#2598 (comment)

@mxinden
Copy link
Contributor

mxinden commented Aug 1, 2022

Cross-referencing libp2p/rust-yamux#137 here. Help testing would be very much appreciated.

@jasl
Copy link
Contributor Author

jasl commented Aug 1, 2022

Cross-referencing libp2p/rust-yamux#137 here. Help testing would be very much appreciated.

I can't stable reproduce the problem in short time, but I'll try to build a testnet to run few days

@jasl
Copy link
Contributor Author

jasl commented Aug 1, 2022

Cross-referencing libp2p/rust-yamux#137 here. Help testing would be very much appreciated.

How can we prove the issue is fixed? just run and watch no more crash?

@mxinden
Copy link
Contributor

mxinden commented Aug 1, 2022

In case you no longer see any panics with the patched version in an environment where you would previously see panics, I would consider this issue fixed.

@gonzamontiel
Copy link

Hey guys, do you consider this bug fixed then? We are experiencing the same error message in v0.9.24 while connecting our collator to Rococo. Just to be sure how is the right approach, would this patch in our node's Cargo.toml be enough to test that it's fixed?

[patch.crates-io]
yamux = { git = 'https://github.com/mxinden/yamux', branch = "next-frame-error" }

@mxinden
Copy link
Contributor

mxinden commented Aug 4, 2022

Hey guys, do you consider this bug fixed then?

👋 It is not yet fixed. I prepared a dirty patch ("next-frame-error" branch) to validate my suspicion. According to the above testing, my suspicion is correct. Thus I created libp2p/rust-yamux#138 which is a patch I would consider ready for merge (i.e. not dirty).

In case you want to help this process, you could run libp2p/rust-yamux#138 on one of your test networks via the below in your Cargo.toml:

[patch.crates-io]
yamux = { git = 'https://github.com/mxinden/yamux', branch = "next-result-immediately }

Once I release libp2p/rust-yamux#138, you only need to run cargo update -p yamux to upgrade to the patched released version.

@mxinden
Copy link
Contributor

mxinden commented Aug 5, 2022

Heads up, yamux v0.10.2 is released. See libp2p/rust-yamux#138 (comment) for details.

Please upgrade to the new version. Thanks for the help everyone!

@jasl
Copy link
Contributor Author

jasl commented Aug 5, 2022

@ordian could you help to release a new polkadot v0.9.27 binary?

@bkchr
Copy link
Member

bkchr commented Aug 5, 2022

As commented here, no need to create a new polkadot release: libp2p/rust-libp2p#2598 (comment)

@gonzamontiel
Copy link

thanks @mxinden , I can confirm that applying the patch yesterday solved our problem, now we just updated yamux to 0.10.2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants