-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.0] Manage global resources and executor services, fix zk client connections #9033
[3.0] Manage global resources and executor services, fix zk client connections #9033
Conversation
…nd-improve-tests-0926
…ExecutorService and HashedWheelTimer
…and-executors-0928
… default constructor
…ion is divided into pre-destroy and post-destroy
applicationDeployer.checkStarted(startFuture); | ||
applicationDeployer.checkStarted(); | ||
// complete module start future after application state changed, fix #9012 ? | ||
startFuture.complete(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. your applicationDeployer.checkStarted() only set starting flag and return directly if one module is starting, it doesn't mean all module started after applicationDeployer.checkStarted().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DefaultApplicationDeployer.checkStarted()
function is used to check all modules state and notify start checking.
If start by module manually in some scenario, maybe some modules are started, but the application is starting until all modules are started.
private void doStart() {
startModules();
// prepare application instance
prepareApplicationInstance();
executorRepository.getSharedExecutor().submit(() -> {
while (true) {
// notify on each module started
synchronized (startedLock) {
try {
startedLock.wait(500);
} catch (InterruptedException e) {
// ignore
}
}
// if has new module, do start again
if (hasPendingModule()) {
startModules();
continue;
}
DeployState state = checkState();
if (!(state == DeployState.STARTING || state == DeployState.PENDING)) {
// start finished or error
break;
}
}
});
}
....
public void checkStarted() {
// TODO improve state checking
DeployState _state = checkState();
switch (_state) {
case STARTED:
onStarted();
break;
case STARTING:
onStarting();
break;
case PENDING:
setPending();
break;
}
// notify started
synchronized (startedLock) {
startedLock.notifyAll();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startFuture
of DefaultModuleDeployer
just monitors the start action of the module, and has no strong relationship with the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set breakpoint at checkStarted() and DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload line 306:
moduleDeployer1.start().get();
debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.
at the first time, if you let main thread continue, the test will be failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.
The first time is started internal module, the second time is started module of serviceConfig1
, the application state is STARTING
, test is passed. This behavior is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload:
ModuleDeployer internalModuleDeployer = applicationModel.getInternalModule().getDeployer();
Assertions.assertTrue(internalModuleDeployer.isStarted());
<== it will be failed if run this before the second time of checkStarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please checkout this pr and test again.
Module start processing is changed, now can be sure to start internal module before pub module, the checking Assertions.assertTrue(internalModuleDeployer.isStarted())
is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, i modify my fix based on your codes to await internal module deployer finished.
@@ -69,7 +69,6 @@ | |||
|
|||
private Map<String, Object> attributes; | |||
private AtomicBoolean destroyed = new AtomicBoolean(false); | |||
private volatile boolean stopping; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #9000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ApplicationModel is stopping, the value of isDestroyed()
is true, no need to add a stopping
field. Furthermore, I want to add a state
field to ApplicationModel
and ModuleModel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see AlbumenJ said in #9001: Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take a look at the new processing of application destroy:
- Destroy by application
ApplicationModel.destroy() (change destroyed to true)
-> ApplicationModel.onDestroy()
-> ApplicationDeployer.preDestroy()
-> set state to stopping
-> destroy all modules and remove self from frameworkModel
-> ApplicationDeployer.postDestroy()
-> cancel asyncMetadataFuture and unregisterServiceInstance
-> ...
-> set state to stopped
-> frameworkModel.tryDestroy()
ApplicationModel and ApplicationDeployer have a strange relationship, ApplicationDeployer is designed to handle Application start/stop. ApplicationDeployer is an associated object of ApplicationModel and becomes clearer when it is destroyed from ApplicationModel.destroy().
- Destroy by module
ModuleModel.destroy()
-> ModuleModel.onDestroy()
-> ModuleDeployer.preDestroy()
-> set module state to stopping
-> remove module from application
-> ModuleDeployer.postDestroy()
-> unexport and unrefer services
-> set module state to stopped
-> applicationModel.tryDestroy()
Destroy pub modules one by one, if only remain internal module, then call ApplicationModel.destroy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seams better to cancel asyncMetadataFuture in ApplicationDeployer.preDestroy()
@@ -324,7 +328,7 @@ public void blockUntilUpdated() { | |||
metadataSemaphore.drainPermits(); | |||
updateLock.writeLock().lock(); | |||
} catch (InterruptedException e) { | |||
if (!applicationModel.isStopping()) { | |||
if (!applicationModel.isDestroyed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AlbumenJ said in #9001:
Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry of destroy an application is ApplicationModel.destroy()
. So, check destroyed is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #9033 (comment)
@@ -83,7 +82,9 @@ public String getInternalProperty(String key) { | |||
|
|||
@Override | |||
protected void doClose() throws Exception { | |||
zkClient.close(); | |||
// TODO zkClient is shared in framework, should not close it here? | |||
// zkClient.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #8993
the send thread of zookeeper ClientCnxn will keep trying reconnect to the closed zk Server if you don't close the zk client and it will cause more integration testing problems because huge reconnect failed events will block curator global event loop processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here: org.apache.dubbo.remoting.zookeeper.AbstractZookeeperTransporter#destroy()
All zk clients is created and destroyed in ZookeeperTransporter.
@Override
public void destroy() {
// only destroy zk clients here
for (ZookeeperClient client : zookeeperClientMap.values()) {
client.close();
}
zookeeperClientMap.clear();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ZookeeperTransporter
is shared in framework scope, so zk clients is shared in framework, cannot be destroyed when one application is shutdown but some application is alive .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tested SingleRegistryCenterInjvmIntegrationTest (using registry only as config center testing has no registry) and zkClient was closed eventually, so the fix is ok for zookeeper client, but zkClient is just one kind of dynamic configuration which are not closed by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's say we have two application A and B. A connect zk A, B connect zk B, A is short life app which only run 1 minute, the zkClient of A should be closed when A exited, otherwise it still cause curator reconnection failure events surge problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If application A and B are not shared same FrameworkModel, they use two isolated Instances of ZookeeperTransporter so that there are no redundant ZK connections.
If A and B use the same FrameworkModel, it will result in A shared ZK connections and cannot simply close part of them when application B stopped. Some ZK connections may be used in both application A and B.
If we want to close ZK connections on application shutdown, we must first record which applications these ZK connections are used in, and then we can safely close idle ZK connections. You can submit pr later for fine application-level ZK connection management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.
ZookeeperTransporter
has been changed to application scope to solve the zk clients leaks problem after application is stopped.
you'd better split this PR based on different subjects, otherwise it might revert other merged PR's working. |
Most of the problems have been solved, because the processing flow has changed greatly, some of the previous repairs are unnecessary, and if there are other problems, they need to be dealt with again. |
@@ -82,8 +82,10 @@ public String getInternalProperty(String key) { | |||
|
|||
@Override | |||
protected void doClose() throws Exception { | |||
// TODO zkClient is shared in framework, should not close it here? | |||
// zkClient is shared in framework, should not close it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping zkClient independent in each application might be better,not all app connect same registry and have same lifecycle.
#9015
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbumenJ @zrlw
There are two ideas:
- Zk Clients are shared within application scope and isolated among different applications
- Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.
If there is no strong need to share ZK Clients within the Framework, I think the first one is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now will force destroy all FrameworkModels after finished every test class by DubboTestChecker
, it is expected no zk clients leaks in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbumenJ @zrlw There are two ideas:
- Zk Clients are shared within application scope and isolated among different applications
- Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.
If there is no strong need to share ZK Clients within the Framework, I think the first one is better.
#9015 close zookeeper client that is no long used by any application.
the advantage is the zk connection will be closed immediately if it's not used anymore.
app A -> zk server IP-A, app B -> zk server IP-B, Notice: IP-A is not equal IP-B !!!
the zk connection of app A will be closed immediately if A exit and B still going on.
if put them in the global zookeeper transporter, close operation must wait another app exit, it will cause curator client keep trying reconnect to the closed zk server and trigger huge curator connection failed events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO. The zk client is shared between ZookeeperRegistry
, ZookeeperDynamicConfiguration
and ZookeeperMetadataReport
, if close it in one place, the other reference use the zk client may throw already destroyed exeception.
The safe way is close zk clients in ApplicationDeployer.postDestroy() after release all zk components before onStopped()
, some thing like: zookeeperTransporter.closeClientsOfApplication(application)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now #9015 destroys zkClients not used at DefaultAppliationDeployer.destroy()
Codecov Report
@@ Coverage Diff @@
## 3.0 #9033 +/- ##
============================================
+ Coverage 63.75% 64.23% +0.47%
- Complexity 313 322 +9
============================================
Files 1168 1174 +6
Lines 50192 50569 +377
Branches 7465 7518 +53
============================================
+ Hits 32001 32481 +480
+ Misses 14703 14540 -163
- Partials 3488 3548 +60
Continue to review full report at Codecov.
|
…and-executors-0928
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What is the purpose of the change
EventLoopGroup
,HashedWheelTimer
, etc.ZookeeperTransporter
to application scope, see [3.0] Manage global resources and executor services, fix zk client connections #9033 (comment)MetadataReportFactory
mvn test -DcheckThreads=true
, seeorg.apache.dubbo.test.check.DubboTestChecker
dubbo-native
Brief changelog
Verifying this change