Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Fix for cleos and keosd race condition #8671

Merged
merged 6 commits into from
Mar 2, 2020
Merged

Fix for cleos and keosd race condition #8671

merged 6 commits into from
Mar 2, 2020

Conversation

allenhan2
Copy link
Contributor

@allenhan2 allenhan2 commented Feb 20, 2020

Change Description

issue: #8614
When keosd is not running, cleos will start keosd, then immediately try to connect to its socket, then fails because wallet_api has not started yet right after http port starts taking incoming traffic.

The fix is to let http_plugin startup yield to others' (including wallet api) startup finish.

Consensus Changes

  • Consensus Changes

API Changes

  • API Changes

Documentation Additions

  • Documentation Additions

Copy link
Contributor

@swatanabe-b1 swatanabe-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better approach would be something like Matt's old PR #5705.
You can avoid the issues with threading that he mentioned with something along these lines:

  1. open a pipe
  2. fork
    3.1) in the child, start appbase, then close the pipe.
    3.2) in the parent, wait on the pipe for EOF and then exit.

try_local_port(2000);

std::string keos_msg;
constexpr size_t TOTOAL_RETRIES = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOTAL.

for(; tries<TOTOAL_RETRIES; ++tries)
{
std::this_thread::sleep_for(std::chrono::milliseconds(KEOS_WAIT_TIME_MS));
std::getline(keos_out, keos_msg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getline is blocking. There's no need to wait.

Copy link
Contributor Author

@allenhan2 allenhan2 Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It behaves as syncing, if keosd returns -1 before (daemonizing) sending "keosd::config::keosd_started_msg", pstream will make getline return immediately. Then, cleos knows that retry fails, print out err msg, then move on, no wait.

if keosd is already there before cleos runs, cleos won't wait. Otherwise, cleos starts keosd, and wait for keosd finishing startup (right before daemonizing). For any reason, if keosd hangs within startup, cleos will hang there too.

If cleos should run independently from keosd, then correct, will change to non-block read instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach is making http_plug startup yield to wallet-api startup. So, no change here.

@allenhan2
Copy link
Contributor Author

The problem seems to me is that the cleos does not know when keosd finishes startup (before daemonizing). not sure why using fork and pipe helps sync the state.

current pstream is the wrapper around pipe between parent and child. Using async read will make cleos and keosd independent.

@swatanabe-b1
Copy link
Contributor

The problem seems to me is that the cleos does not know when keosd finishes startup (before daemonizing). not sure why using fork and pipe helps sync the state.

Did you look at #5705? I meant to use fork at startup in keosd.

current pstream is the wrapper around pipe between parent and child. Using async read will make cleos and keosd independent.

You shouldn't abuse stdout for this synchronization. keosd may write other things to stdout and will break if it's a pipe whose other end is closed instead of /dev/null.

@allenhan2
Copy link
Contributor Author

thanks steve, only saw matt's issue description but did not see his PR. understand what you mean now.

@allenhan2
Copy link
Contributor Author

allenhan2 commented Feb 21, 2020

chatted with @spoonincode, and suggestion is focusing on why it is broken and not before, could be related to some subtle asio priority issue. At the same time, chatted with @swatanabe-b1, and could be the issue is always there. I will do more research on this before further action.

@allenhan2
Copy link
Contributor Author

From code, the issue is cleos thinks keosd is ready when http port is pingable. But, very often wallet api has not finished loading yet when port is pingable.
Since I can not pass the logging setting from cleos -> keos ->http_plugin->wallet api loading, I can not verify my theory.
However, it helps cleos to sync with keosd state by adding signal sync. It is minimal change to make cleos and keosd working.
The syncing issue is application level, need to refactor and test across all applications in the future.

Copy link
Contributor

@heifner heifner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let Steven comment on approach. These are mainly style comments.

static constexpr uint32_t WAIT_DURATION_MS = 8000;
};
} // cleos
} // eosio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not go in this file. This file is copied from an open source project.

if (!cleos::KeosdSignalHndl::wait_keosd_start())
{
std::cerr << "Failed to launch " << binPath << std::endl;
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow code style found in files you modify.

@@ -77,6 +77,7 @@ Usage: ./cleos create account [OPTIONS] creator name OwnerKey ActiveKey
#include <fc/exception/exception.hpp>
#include <fc/variant_object.hpp>
#include <fc/static_variant.hpp>
#include <thread>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include not needed.

@@ -43,12 +60,23 @@ int main(int argc, char** argv)
.default_unix_socket_path = keosd::config::key_store_executable_name + ".sock",
.default_http_port = 0
});
const int parent_id = get_parent_pid(argc, argv);
const int argc_no_parent_id = (parent_id > 0) ? argc - 2 : argc;
char** const argv_no_parent_id = (parent_id > 0) ? &argv[2] : argv;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very fragile. What if it is not the first argument?

{
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use consistent code style found used in files you modify.


namespace eosio {
namespace cleos {
struct KeosdSignalHndl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use snake case for all struct/classes. Follow code style found in existing code.

return (ready_signal_received == KEOSD_READY);
}
static state ready_signal_received;
static constexpr uint32_t WAIT_DURATION_MS = 8000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All caps is used for macros not constants.

namespace eosio {
namespace cleos {
struct KeosdSignalHndl {
enum state : size_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use enum class

enum state : size_t
{
KEOSD_NOT_READY = 0,
KEOSD_READY = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use all caps for constants

}
return (ready_signal_received == KEOSD_READY);
}
static state ready_signal_received;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init variable

@spoonincode
Copy link
Contributor

The current approach, as cheesy at it is, has worked correctly in the past. From a quick code review, I am a little suspicious that the http thread pool may have introduced this problem. The thread_pool starts event handling immediately, so when the HTTP server starts listening in http_pligin::plugin_startup() it is able to service incoming requests immediately. However, it's not until wallet_api_plugin::plugin_startup() that the wallet endpoints are registered. I'm pretty sure http_plugin startup will always run before wallet_api_plugin::plugin_startup(), so it's ready to serve requests before wallet endpoints are registered.

Previously back when there was no http thread pool, the event loop http plugin was using wouldn't have been pumped until all the plugin startup()s had been called.

I wonder if we simply moved the add_api() calls in wallet_api's startup to wallet_api's initialize if everything would be fixed? It's a little icky because I think it means the wallet plugin may be sent requests before wallet's startup()....... but the wallet plugin doesn't have a startup() so I don't think we'd run afoul of anything at the moment?

Though, fwiw, I'm much in favor of a better startup solution. But the approach in this PR and the previous daemonization approach I did don't support win32. That's something that in my opinion we need to be mindful about.

@spoonincode
Copy link
Contributor

Well, if the above theory is really the problem, probably having http_plugin post() its startup routine to the main thread event loop is even better.

@allenhan2
Copy link
Contributor Author

@spoonincode made a quick try, your theory works. Let me see how to make the http_plugin post()

}

static void handle_keosd_start_signal(int) {
ready_signal_received = KEOSD_READY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a data race. See [intro.races]
"Two actions are potentially concurrent if... they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation."

"The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks, it is good reading.


if (parent_id > 0)
{
const int ret_sig_que = kill(parent_id, SIGUSR1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the parent has already exited?
What happens if keosd initialization fails (e.g. someone else launched keosd concurrently).

@allenhan2
Copy link
Contributor Author

Added post to delay the threadpool and related port initialization in http_plugin's startup. If any exception during delayed initialization, shutdown the app. Tested on my mac desktop, seems working well. Please let me know any other tests or test cases.
thanks, @spoonincode @swatanabe-b1 @heifner

if(my->listen_endpoint) {
app().post(appbase::priority::high, [this] ()
{
my->thread_pool.emplace( "http", my->thread_pool_size );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into the try-catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@allenhan2 allenhan2 dismissed swatanabe-b1’s stale review February 26, 2020 16:26

The new approach is to make http start_up yield to wallet api start_up, no change on cleos side

@allenhan2
Copy link
Contributor Author

@swatanabe-b1 could you please help to give a review of this new approach. @heifner asks your review on the approach (even it was for the previous one),
BTW, dismissed your comments on the old approach though.
thanks!

@allenhan2 allenhan2 merged commit a58e485 into develop Mar 2, 2020
@allenhan2 allenhan2 deleted the cleos_keods_race branch March 2, 2020 19:50
revl added a commit that referenced this pull request May 8, 2020
These changes were supposed to fix the problem of keosd auto-start by
cleos, which manifests itself in the following error message:

    $ ./programs/cleos/cleos wallet list
    .../programs/keosd/keosd" launched
    Unable to connect to keosd, if keosd is running please kill the
    ...process and try again.

I cannot confirm that the fix works, but the patch from #8671 is applied
verbatim without conflicts.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants