[3.0] Manage global resources and executor services, fix zk client connections #9033

kylixs · 2021-10-13T09:13:11Z

What is the purpose of the change

Improve and destroy executor services
Support manage global static resources, including EventLoopGroup, HashedWheelTimer, etc.
Improve application/module deploying
Fix zk client connection leaks and early closing problem, change ZookeeperTransporter to application scope, see [3.0] Manage global resources and executor services, fix zk client connections #9033 (comment)
Destroy nacos client and MetadataReportFactory
Support check unclosed threads in unit tests by mvn test -DcheckThreads=true, see org.apache.dubbo.test.check.DubboTestChecker
Sort methods of adaptive classes in dubbo-native

Brief changelog

Verifying this change

…nd-improve-tests-0926

…ExecutorService and HashedWheelTimer

…and-executors-0928

… default constructor

…ion is divided into pre-destroy and post-destroy

zrlw · 2021-10-13T10:47:14Z

...fig/dubbo-config-api/src/main/java/org/apache/dubbo/config/deploy/DefaultModuleDeployer.java

-        applicationDeployer.checkStarted(startFuture);
+        applicationDeployer.checkStarted();
+        // complete module start future after application state changed, fix #9012 ?
+        startFuture.complete(true);


no. your applicationDeployer.checkStarted() only set starting flag and return directly if one module is starting, it doesn't mean all module started after applicationDeployer.checkStarted().

The DefaultApplicationDeployer.checkStarted() function is used to check all modules state and notify start checking.
If start by module manually in some scenario, maybe some modules are started, but the application is starting until all modules are started.

private void doStart() { startModules(); // prepare application instance prepareApplicationInstance(); executorRepository.getSharedExecutor().submit(() -> { while (true) { // notify on each module started synchronized (startedLock) { try { startedLock.wait(500); } catch (InterruptedException e) { // ignore } } // if has new module, do start again if (hasPendingModule()) { startModules(); continue; } DeployState state = checkState(); if (!(state == DeployState.STARTING || state == DeployState.PENDING)) { // start finished or error break; } } }); } .... public void checkStarted() { // TODO improve state checking DeployState _state = checkState(); switch (_state) { case STARTED: onStarted(); break; case STARTING: onStarting(); break; case PENDING: setPending(); break; } // notify started synchronized (startedLock) { startedLock.notifyAll(); } }

The startFuture of DefaultModuleDeployer just monitors the start action of the module, and has no strong relationship with the application.

set breakpoint at checkStarted() and DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload line 306:

moduleDeployer1.start().get();

debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.
at the first time, if you let main thread continue, the test will be failed.

debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.

The first time is started internal module, the second time is started module of serviceConfig1, the application state is STARTING, test is passed. This behavior is expected.

DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload:

ModuleDeployer internalModuleDeployer = applicationModel.getInternalModule().getDeployer(); Assertions.assertTrue(internalModuleDeployer.isStarted()); <== it will be failed if run this before the second time of checkStarted

Please checkout this pr and test again.
Module start processing is changed, now can be sure to start internal module before pub module, the checking Assertions.assertTrue(internalModuleDeployer.isStarted()) is ok.

you are right, i modify my fix based on your codes to await internal module deployer finished.

zrlw · 2021-10-13T10:53:21Z

dubbo-common/src/main/java/org/apache/dubbo/rpc/model/ScopeModel.java

@@ -69,7 +69,6 @@

    private Map<String, Object> attributes;
    private AtomicBoolean destroyed = new AtomicBoolean(false);
-    private volatile boolean stopping;



If the ApplicationModel is stopping, the value of isDestroyed() is true, no need to add a stopping field. Furthermore, I want to add a state field to ApplicationModel and ModuleModel.

see AlbumenJ said in #9001: Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().

Let's take a look at the new processing of application destroy:

Destroy by application

ApplicationModel.destroy() (change destroyed to true) -> ApplicationModel.onDestroy() -> ApplicationDeployer.preDestroy() -> set state to stopping -> destroy all modules and remove self from frameworkModel -> ApplicationDeployer.postDestroy() -> cancel asyncMetadataFuture and unregisterServiceInstance -> ... -> set state to stopped -> frameworkModel.tryDestroy()

ApplicationModel and ApplicationDeployer have a strange relationship, ApplicationDeployer is designed to handle Application start/stop. ApplicationDeployer is an associated object of ApplicationModel and becomes clearer when it is destroyed from ApplicationModel.destroy().

Destroy by module

ModuleModel.destroy() -> ModuleModel.onDestroy() -> ModuleDeployer.preDestroy() -> set module state to stopping -> remove module from application -> ModuleDeployer.postDestroy() -> unexport and unrefer services -> set module state to stopped -> applicationModel.tryDestroy()

Destroy pub modules one by one, if only remain internal module, then call ApplicationModel.destroy()

It seams better to cancel asyncMetadataFuture in ApplicationDeployer.preDestroy()

zrlw · 2021-10-13T10:58:00Z

...in/java/org/apache/dubbo/registry/client/metadata/store/InMemoryWritableMetadataService.java

@@ -324,7 +328,7 @@ public void blockUntilUpdated() {
            metadataSemaphore.drainPermits();
            updateLock.writeLock().lock();
        } catch (InterruptedException e) {
-            if (!applicationModel.isStopping()) {
+            if (!applicationModel.isDestroyed()) {


AlbumenJ said in #9001:
Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().

The entry of destroy an application is ApplicationModel.destroy(). So, check destroyed is ok.

See #9033 (comment)

zrlw · 2021-10-13T11:09:08Z

...main/java/org/apache/dubbo/configcenter/support/zookeeper/ZookeeperDynamicConfiguration.java

@@ -83,7 +82,9 @@ public String getInternalProperty(String key) {

    @Override
    protected void doClose() throws Exception {
-        zkClient.close();
+        // TODO zkClient is shared in framework, should not close it here?
+        // zkClient.close();


see #8993
the send thread of zookeeper ClientCnxn will keep trying reconnect to the closed zk Server if you don't close the zk client and it will cause more integration testing problems because huge reconnect failed events will block curator global event loop processing.

See here: org.apache.dubbo.remoting.zookeeper.AbstractZookeeperTransporter#destroy()
All zk clients is created and destroyed in ZookeeperTransporter.

@Override public void destroy() { // only destroy zk clients here for (ZookeeperClient client : zookeeperClientMap.values()) { client.close(); } zookeeperClientMap.clear(); }

The ZookeeperTransporter is shared in framework scope, so zk clients is shared in framework, cannot be destroyed when one application is shutdown but some application is alive .

i tested SingleRegistryCenterInjvmIntegrationTest (using registry only as config center testing has no registry) and zkClient was closed eventually, so the fix is ok for zookeeper client, but zkClient is just one kind of dynamic configuration which are not closed by now.

let's say we have two application A and B. A connect zk A, B connect zk B, A is short life app which only run 1 minute, the zkClient of A should be closed when A exited, otherwise it still cause curator reconnection failure events surge problem.

If application A and B are not shared same FrameworkModel, they use two isolated Instances of ZookeeperTransporter so that there are no redundant ZK connections.
If A and B use the same FrameworkModel, it will result in A shared ZK connections and cannot simply close part of them when application B stopped. Some ZK connections may be used in both application A and B.
If we want to close ZK connections on application shutdown, we must first record which applications these ZK connections are used in, and then we can safely close idle ZK connections. You can submit pr later for fine application-level ZK connection management.

A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.

A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.

ZookeeperTransporter has been changed to application scope to solve the zk clients leaks problem after application is stopped.

zrlw · 2021-10-13T11:21:47Z

you'd better split this PR based on different subjects, otherwise it might revert other merged PR's working.

kylixs · 2021-10-13T11:37:02Z

you'd better split this PR based on different subjects, otherwise it might revert other merged PR's working.

Most of the problems have been solved, because the processing flow has changed greatly, some of the previous repairs are unnecessary, and if there are other problems, they need to be dealt with again.

zrlw · 2021-10-13T16:58:51Z

...main/java/org/apache/dubbo/configcenter/support/zookeeper/ZookeeperDynamicConfiguration.java

@@ -82,8 +82,10 @@ public String getInternalProperty(String key) {

    @Override
    protected void doClose() throws Exception {
-        // TODO zkClient is shared in framework, should not close it here?
+        // zkClient is shared in framework, should not close it here


keeping zkClient independent in each application might be better，not all app connect same registry and have same lifecycle.
#9015

@AlbumenJ @zrlw
There are two ideas:

Zk Clients are shared within application scope and isolated among different applications

Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.

If there is no strong need to share ZK Clients within the Framework, I think the first one is better.

Now will force destroy all FrameworkModels after finished every test class by DubboTestChecker, it is expected no zk clients leaks in unit tests.

@AlbumenJ @zrlw There are two ideas:

Zk Clients are shared within application scope and isolated among different applications

Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.

If there is no strong need to share ZK Clients within the Framework, I think the first one is better.

#9015 close zookeeper client that is no long used by any application.
the advantage is the zk connection will be closed immediately if it's not used anymore.
app A -> zk server IP-A, app B -> zk server IP-B, Notice: IP-A is not equal IP-B !!!
the zk connection of app A will be closed immediately if A exit and B still going on.
if put them in the global zookeeper transporter, close operation must wait another app exit, it will cause curator client keep trying reconnect to the closed zk server and trigger huge curator connection failed events.

NO. The zk client is shared between ZookeeperRegistry , ZookeeperDynamicConfiguration and ZookeeperMetadataReport, if close it in one place, the other reference use the zk client may throw already destroyed exeception.
The safe way is close zk clients in ApplicationDeployer.postDestroy() after release all zk components before onStopped(), some thing like: zookeeperTransporter.closeClientsOfApplication(application)

now #9015 destroys zkClients not used at DefaultAppliationDeployer.destroy()

…oy application

codecov-commenter · 2021-10-14T04:27:37Z

Codecov Report

Merging #9033 (0752315) into 3.0 (dcfd3ed) will increase coverage by 0.47%.
The diff coverage is 71.62%.

@@             Coverage Diff              @@
##                3.0    #9033      +/-   ##
============================================
+ Coverage     63.75%   64.23%   +0.47%     
- Complexity      313      322       +9     
============================================
  Files          1168     1174       +6     
  Lines         50192    50569     +377     
  Branches       7465     7518      +53     
============================================
+ Hits          32001    32481     +480     
+ Misses        14703    14540     -163     
- Partials       3488     3548      +60

Impacted Files	Coverage Δ
...nfig/configcenter/nop/NopDynamicConfiguration.java	`22.22% <0.00%> (-2.78%)`	⬇️
...ava/org/apache/dubbo/config/ServiceConfigBase.java	`61.23% <0.00%> (-0.35%)`	⬇️
...in/java/org/apache/dubbo/rpc/model/ScopeModel.java	`80.59% <ø> (+11.58%)`	⬆️
...ava/org/apache/dubbo/config/DubboShutdownHook.java	`42.42% <0.00%> (ø)`
...ter/support/apollo/ApolloDynamicConfiguration.java	`46.98% <ø> (ø)`
...enter/support/nacos/NacosConfigServiceWrapper.java	`37.50% <0.00%> (-5.36%)`	⬇️
...enter/support/nacos/NacosDynamicConfiguration.java	`22.36% <0.00%> (-0.30%)`	⬇️
...e/dubbo/metadata/report/MetadataReportFactory.java	`0.00% <0.00%> (ø)`
.../apache/dubbo/qos/command/impl/ShutdownTelnet.java	`68.00% <0.00%> (-12.77%)`	⬇️
...g/apache/dubbo/rpc/protocol/grpc/GrpcProtocol.java	`63.52% <0.00%> (-1.54%)`	⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcfd3ed...0752315. Read the comment docs.

…and-executors-0928

AlbumenJ

LGTM

kylixs added 30 commits September 26, 2021 14:56

shutdown scheduledExecutors and executorServiceRing

246338d

fix NPE of configuration

4c81513

Fix notify loop after executor service is shutdown

12ec12a

improve registry center in tests

5d98a03

polish unregister shutdown hook log

8c0b894

Merge remote-tracking branch 'origin/3.0' into 3.0-destroy-executor-a…

557bae6

…nd-improve-tests-0926

Fix systemConfiguration NPE

89e0532

enable surefire reuseForks for fast testing

38e3100

Fix checking DubboShutdownHook is alive

d9489c0

fix SPI/Bean instantiation constructor matching

b94ed26

shutdown share executor

7ca0afd

release zk client

582792c

Improve application destroy processing

f55434b

Add GlobalResourcesRepository to manager static resources, including …

af3df37

…ExecutorService and HashedWheelTimer

Merge remote-tracking branch 'origin/3.0' into 3.0-release-zk-client-…

c2cd11e

…and-executors-0928

revert zk client address mapping

9c2cc62

Fix zk client early close problem

bd83551

improve executor service

a6d591d

improve both start by module and by application

fd42571

release sharedScheduledExecutor and fix AccessLogFilter

a489403

Fix tests

fb5e4de

checking zk registry if destroyed

c3614b0

fix testSystemPropertyOverrideReferenceConfig config error

47b4590

merge 3.0 pr apache#8989

72bb22e

improve KeepRunningOnSpringClosedTest

69d35a2

SPI extension/scope bean prefers parameterized constructor instead of…

d590e03

… default constructor

Support compatible usage of ZookeeperRegistryFactory

18bd304

Improve the application/module destroy process, the deployer destruct…

2af50c5

…ion is divided into pre-destroy and post-destroy

Destroy NacosDynamicConfiguration

666d8bd

Destroy MetadataReportFactory

b004cbe

kylixs added 5 commits October 13, 2021 16:49

add pom.xml of dubbo-test-check

f3cb9cd

merge origin/3.0

8899d5a

fix CodecSupport.checkSerialization error

cbac3bf

Fix ReferenceCountExchangeClientTest log msg checking

ee97109

Fix ServiceInstanceMetadataUtilsTest and MigrationInvokerTest

4731ca0

zrlw reviewed Oct 13, 2021

View reviewed changes

kylixs mentioned this pull request Oct 13, 2021

[3.0] close dynamic configuration to fix configCenter connection leak #9003

Closed

zrlw reviewed Oct 13, 2021

View reviewed changes

kylixs mentioned this pull request Oct 13, 2021

Lots of zookeeper cleanup error when dubbo shutdown #8711

Open

2 tasks

zrlw reviewed Oct 13, 2021

View reviewed changes

kylixs added 2 commits October 14, 2021 00:16

Fix protocol destroy problem

bb6ac5b

rename test check jvm args

f5abd9d

zrlw reviewed Oct 13, 2021

View reviewed changes

kylixs mentioned this pull request Oct 14, 2021

[3.0] close zookeeper client no longer used, change getRegistries return type #9015

Closed

cancel asyncMetadataFuture and unregisterServiceInstance in pre-destr…

ee2b9ef

…oy application

kylixs mentioned this pull request Oct 14, 2021

[3.0] RpcException/RuntimeException when Provider jvm exit #8985

Closed

2 tasks

kylixs added 5 commits October 14, 2021 13:04

Ignore refresh metadata if application is stopping

5eecf15

Remove unused destroyDynamicConfigurations method

2da8628

Merge remote-tracking branch 'origin/3.0' into 3.0-release-zk-client-…

85fabb5

…and-executors-0928

Change ZookeeperTransporter to application scope

ee7e3ef

Add testMultiProviderApplicationsStopOneByOne

0752315

AlbumenJ approved these changes Oct 18, 2021

View reviewed changes

kylixs added 3 commits October 18, 2021 15:48

Destroy ZookeeperTransporter in RemotingScopeModelInitializer

a9092af

support specify test check report file

953933e

merge origin/3.0

abfea43

AlbumenJ merged commit 1bdf359 into apache:3.0 Oct 19, 2021

[3.0] Manage global resources and executor services, fix zk client connections #9033

[3.0] Manage global resources and executor services, fix zk client connections #9033

Conversation

kylixs commented Oct 13, 2021 • edited Loading

What is the purpose of the change

Brief changelog

Verifying this change

zrlw Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zrlw Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylixs Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zrlw Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylixs Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

zrlw Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

zrlw Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

kylixs Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zrlw commented Oct 13, 2021

kylixs commented Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

kylixs Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylixs Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 14, 2021 • edited Loading

Codecov Report

AlbumenJ left a comment

Choose a reason for hiding this comment

kylixs commented Oct 13, 2021 •

edited

Loading

zrlw Oct 13, 2021 •

edited

Loading

zrlw Oct 13, 2021 •

edited

Loading

kylixs Oct 14, 2021 •

edited

Loading

zrlw Oct 13, 2021 •

edited

Loading

kylixs Oct 13, 2021 •

edited

Loading

zrlw Oct 13, 2021 •

edited

Loading

zrlw Oct 13, 2021 •

edited

Loading

kylixs Oct 14, 2021 •

edited

Loading

kylixs commented Oct 13, 2021 •

edited

Loading

kylixs Oct 14, 2021 •

edited

Loading

kylixs Oct 14, 2021 •

edited

Loading

codecov-commenter commented Oct 14, 2021 •

edited

Loading