Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

close binary-proto-lookup connection after connection-lifetime #252

Closed
wants to merge 2 commits into from

Conversation

rdhabalia
Copy link
Contributor

Motivation

PulsarClient creates a separate dedicated connection with broker for a given service-lookup url. So, in case of broker's cold-restart, all clients may end-up creating lookup-connection to same one broker and this connection will be persisted for entire lifetime of the client, which may create lookup-overload on that one broker. So, lookup connection should not be persistent and it should be closed after sometime.

Modifications

  • disconnect lookup connection after x seconds (default = 10 min)

Result

Lookup connection will not be persistent and it will avoid one broker to be overloaded with lookup requests.

@rdhabalia rdhabalia added area/client type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages labels Feb 28, 2017
@rdhabalia rdhabalia added this to the 1.17 milestone Feb 28, 2017
@rdhabalia rdhabalia self-assigned this Feb 28, 2017
@merlimat
Copy link
Contributor

So, in case of broker's cold-restart, all clients may end-up creating lookup-connection to same one broker

Uhm, all the lookup from a single client will end up on the same broker/discovery-service instance, but there is no reason why lookups from multiple clients should do the same.

@rdhabalia
Copy link
Contributor Author

but there is no reason why lookups from multiple clients should do the same.

In case of cold-restart as soon as first broker comes up all the client tries to do lookup and they will endup creating lookup-connection with one broker. Also, lookup connection is persistent so, it will be used forever for that client so, that one broker will endup serving all the lookup requests. So, does it make sense to not keep lookup-connection persistent and can be closed after fixed time..??

@rdhabalia rdhabalia requested a review from saandrews March 1, 2017 00:46
/**
*
* @param address
* remote client {@link InetSocketAddress} of the to connect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentence seems to be incomplete?

@@ -377,6 +378,7 @@ SocketAddress serverAddrees() {
ctx.writeAndFlush(cmd).addListener(writeFuture -> {
if (!writeFuture.isSuccess()) {
log.warn("{} Failed to send request to broker: {}", ctx.channel(), writeFuture.cause().getMessage());
pendingRequests.remove(requestId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be getAndRemovePendingLookupRequest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, actually we had a bug where if we failed to publish message on socket then we should remove from the queue. line:341 removes from lookup-PendingRequests and this one removes from pendingRequests

cnx.channel().disconnect();
}
}, connectionLifetimeInSecond, TimeUnit.SECONDS);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we closing the connection irrespective of it's active or not? We should close it only if it's inactive for the last n minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our goal is to not keep lookup connection persistent forever and close after n minute. however, we can add check to close connection if not active for last n minute. i will add another commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saandrews updated PR with the change.

@merlimat
Copy link
Contributor

merlimat commented Mar 1, 2017

In case of cold-restart as soon as first broker comes up all the client tries to do lookup and they will endup creating lookup-connection with one broker.

Good point.. In that case all the connections would stick to same broker.

My only concern is that by closing the connection after X time will make every lookup after a failover to open a new TCP connection (one per client) and re-do authentication, while if we maintain the connection open it would be ready to use. I'm not sure that would be a big deal though.

@saandrews
Copy link
Contributor

In that case, it would behave similar to our http lookup for the first request. Should be ok I think.

@saandrews
Copy link
Contributor

There are few more scenarios to address: Handling failure in case broker is unable to serve the lookup(similar to http 500) and closing it only if inactive for the configured duration. Will be addressed in a separate PR.

@merlimat
Copy link
Contributor

merlimat commented Mar 1, 2017

In that case, it would behave similar to our http lookup for the first request. Should be ok I think.

True, it would kind of be the same. One other option could be to react on the lookups being rejected by the broker (from #181) and close the connection only if errors are received in a certain timeframe.

@rdhabalia
Copy link
Contributor Author

One other option could be to react on the lookups being rejected by the broker (from #181) and close the connection only if errors are received in a certain timeframe.

Does it mean client should close the connection when broker gives TooManyRequestError back? It means

  1. If client doesn't receive TooManyRequestError then client will keep this persistent-connection forever. So, there is a still chance that all clients go to the same broker and can still create load on one broker if traffic is < 20K (maxConcurrentLookupRequest=20000)

But I think I agree with the idea that if client gets TooManyRequestError in certain timeframe then client should disconnect the connection and try to connect to different broker (as that broker is already overloaded).

Same as like #265 we close http-connection on internal-server-error, we need to do it for binary-lookup. So, I will combine 2 changes in new PR:

  1. close binary-proto-lookup connection on 500 error
  2. close connection if client receives TooManyLookupRequestError in certain timeframe.

And we can think of if we don't need this PR-change (close persistent-connection) then we may close this one.

Any thought?

@merlimat
Copy link
Contributor

merlimat commented Mar 2, 2017

My reasoning is:

  • All client will end up in a single broker only if it's the only one left in cluster
  • At that point the broker should be overwhelmed by lookups, if it is not... then we should be good anyway
  • If we force close the connections, we can give the clients a chance to reconnect to the other brokers that are being brought up
  1. close binary-proto-lookup connection on 500 error
  2. close connection if client receives TooManyLookupRequestError in certain timeframe.

Yep, that's what I was thinking

@merlimat
Copy link
Contributor

@rdhabalia Should this PR be closed in favor of #274 ?

@rdhabalia
Copy link
Contributor Author

yes, let's close this one.

@rdhabalia rdhabalia closed this Mar 10, 2017
@merlimat merlimat removed this from the 1.17 milestone Mar 10, 2017
sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
hangc0276 added a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
Fix apache#252 

### Motivation
When bundle unload triggered, the `consumerTopicManagers` cache won't be evicted and it will return the old `KafkaTopicConsumerManager` instance in the next fetch request handle. However, after bundle unload, the producer/consumer/managedLedger of topics in related bundle will be closed. If we use old  `KafkaTopicConsumerManager` instance to read messages, it will return `managedLedger has been closed` exception.

### Changes
1. Change `consumerTopicManagers`, `topics`, `references` map to static attribute ConcurrentHashMap.
2. Evict related cache information for topics whose bundle trigged unload.
3. Turn on `DistributedClusterTest `.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants