-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade thespian to fix hang on macOS #1653
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't delved into it, but this upgrade breaks Rally completely on MacOS:
$ esrally race --track=geopoint --pipeline=benchmark-only --target-hosts=https://localhost:9222 --challenge=append-fast-with-conflicts --telemetry=node-stats --track-params="bulk_indexing_clients:32" --client-options="timeout:180,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" --kill-running-processes --test-mode
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [0557ed5b-3bb2-46e9-a857-df177e17697b]
DEBUG:esrally.actor:Checking capabilities [{'Thespian ActorSystem Name': 'multiprocQueueBase', 'Thespian ActorSystem Version': 1, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1673927835624'}] against requirements [{'coordinator': True}] failed.
[WARNING] Could not terminate all internal processes within timeout. Please check and force-terminate all Rally processes.
[ERROR] Cannot race. This race ended with a fatal crash.
Getting further help:
*********************
* Check the log files in /Users/bradleydeam/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /Users/bradleydeam/.rally/logs.
--------------------------------
[INFO] FAILURE (took 25 seconds)
--------------------------------
Rally logs
23-01-17 03:57:10,587 -not-actor-/PID:98405 esrally.rally INFO Actor system already running locally? [False]
2023-01-17 03:57:10,587 -not-actor-/PID:98405 esrally.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900'}].
2023-01-17 03:57:10,615 -not-actor-/PID:98420 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-01-17 03:57:15,616 -not-actor-/PID:98405 esrally.actor ERROR Could not initialize internal actor system.
Traceback (most recent call last):
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py", line 263, in bootstrap_actor_system
return thespian.actors.ActorSystem(system_base, logDefs=log.load_configuration(), capabilities=capabilities)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 637, in __init__
systemBase = self._startupActorSys(
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 678, in _startupActorSys
systemBase = sbc(self, logDefs=logDefs)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocTCPBase.py", line 28, in __init__
super(ActorSystemBase, self).__init__(system, logDefs)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 83, in __init__
super(multiprocessCommon, self).__init__(system, logDefs)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 326, in __init__
self._startAdmin(self.adminAddr,
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/multiprocCommon.py", line 115, in _startAdmin
raise InvalidActorAddress(adminAddr,
thespian.actors.InvalidActorAddress: ActorAddr-(T|:1900) is not a valid ActorSystem admin
2023-01-17 03:57:15,621 -not-actor-/PID:98405 esrally.rally INFO Falling back to offline actor system.
2023-01-17 03:57:15,636 -not-actor-/PID:98433 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-Q.ThespianQ
2023-01-17 03:57:15,650 -not-actor-/PID:98405 esrally.racecontrol INFO Race id is [0557ed5b-3bb2-46e9-a857-df177e17697b]
2023-01-17 03:57:15,651 -not-actor-/PID:98405 esrally.racecontrol INFO User specified pipeline [benchmark-only].
2023-01-17 03:57:15,651 -not-actor-/PID:98405 esrally.racecontrol INFO Using configured hosts [{'host': 'localhost', 'port': 9222, 'use_ssl': True}]
2023-01-17 03:57:18,660 -not-actor-/PID:98405 esrally.rally INFO Attempting to shutdown internal actor system.
2023-01-17 03:57:18,684 -not-actor-/PID:98433 root INFO ---- Actor System shutdown
2023-01-17 03:57:18,685 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:19,691 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:20,696 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:21,702 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:22,704 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:23,709 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:24,715 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:25,716 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:26,721 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:27,723 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:28,725 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:29,731 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:30,734 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:31,738 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:32,741 -not-actor-/PID:98405 esrally.rally INFO Actor system is still running. Waiting...
2023-01-17 03:57:33,742 -not-actor-/PID:98405 esrally.rally WARNING Shutdown timed out. Actor system is still running.
2023-01-17 03:57:33,742 -not-actor-/PID:98405 esrally.rally ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 369, in run
pipeline(cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 71, in __call__
self.target(cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 308, in benchmark_only
return race(cfg, external=True)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 256, in race
benchmark_actor = actor_system.createActor(BenchmarkActor, targetActorRequirements={"coordinator": True})
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 704, in createActor
return self._systemBase.newPrimaryActor(actorClass,
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 205, in newPrimaryActor
raise NoCompatibleSystemForActor(
thespian.actors.NoCompatibleSystemForActor: No compatible ActorSystem could be found for Actor <class 'esrally.racecontrol.BenchmarkActor'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rally.py", line 1171, in dispatch_sub_command
race(cfg, args.kill_running_processes)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rally.py", line 919, in race
with_actor_system(racecontrol.run, cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/rally.py", line 949, in with_actor_system
runnable(cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 378, in run
raise exceptions.RallyError("This race ended with a fatal crash.").with_traceback(tb)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 369, in run
pipeline(cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 71, in __call__
self.target(cfg)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 308, in benchmark_only
return race(cfg, external=True)
File "/Users/bradleydeam/perf/github.com/b-deam/rally/esrally/racecontrol.py", line 256, in race
benchmark_actor = actor_system.createActor(BenchmarkActor, targetActorRequirements={"coordinator": True})
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/actors.py", line 704, in createActor
return self._systemBase.newPrimaryActor(actorClass,
File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/systemBase.py", line 205, in newPrimaryActor
raise NoCompatibleSystemForActor(
esrally.exceptions.RallyError: This race ended with a fatal crash.
I tried it against a non-loopback interface and got the same error. I wonder if @inqueue has the same issue?
FWIW, I tested with |
Thanks! Interesting. If it's not too much work, would you mind trying again with thespian debug logs activated and share the output? |
Sure! I think the key line is here, but I haven't dug into why my system is no longer meeting the capabilities requirements. In addition to testing with $ tail -f actor-system-internal.log
2023-01-17 17:24:23.715134 p11971 dbg ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-17 17:24:23.715477 p11971 I ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10)
2023-01-17 17:24:28.757419 p12025 dbg ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py
2023-01-17 17:24:28.761218 p12025 I ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 17:24:28.770882 p12025 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x106bd5dc0>)
2023-01-17 17:24:28.771869 p12025 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999780-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x106bd5f10>-quit_0:04:59.999757)
2023-01-17 17:24:28.774415 p11963 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999684-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x106bcaf70>-quit_0:04:59.999662)
2023-01-17 17:24:28.775267 p12025 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x106be6040>)
2023-01-17 17:24:28.775951 p12025 dbg Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999735-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x106bd5d60>-quit_0:04:59.999715)
2023-01-17 17:24:28.776373 p12025 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999303-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x106bd5d60>-quit_0:04:59.999289)
2023-01-17 17:24:28.779637 p11963 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999913-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999898)
2023-01-17 17:24:28.780059 p12025 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 17:24:28.780399 p12025 I Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 17:24:28.781272 p12025 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999898-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/bradleydeam/perf/github.com/b-deam/rally/esrally/actor.py, 122, "Checking capabilities [{'Thespian ActorSystem N...-quit_0:04:59.999882)
- 2023-01-17 17:24:28.782002 p12025 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 17:24:28.782706 p12025 dbg Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999777-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999756)
2023-01-17 17:24:28.783091 p12025 dbg actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999380-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999366)
2023-01-17 17:24:31.786774 p11963 I ActorSystem shutdown requested. Also worth adding a stripped down repro using
|
Thanks a lot! The only information left I need to report the bug to Thespian would be the same logs but on thespian 3.10.6. Also, why specifically |
I see the same issue and I also saw it with a patched 3.10.6 during testing. Unfortunately, I actually validated the changes from https://github.com/thespianpy/Thespian/tree/issue70 by patching the version currently in use by Rally (3.10.1). The error has been referenced at thespianpy/Thespian#70 (comment). |
For what it's worth I'm seeing |
|
I'm hoping to be able to re-spin this PR after thespianpy/Thespian#77 is addressed. |
Closes #1575
See thespianpy/Thespian#70 for more context.