Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3145: Implement some notion of retry logic around call to metadata service for Astra clients #1902

Open
wants to merge 2 commits into
base: 4.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
*/
package com.datastax.oss.driver.internal.core.config.cloud;

import com.datastax.oss.driver.api.core.DriverTimeoutException;
import com.datastax.oss.driver.api.core.metadata.EndPoint;
import com.datastax.oss.driver.internal.core.metadata.SniEndPoint;
import com.datastax.oss.driver.internal.core.ssl.SniSslEngineFactory;
Expand Down Expand Up @@ -58,6 +59,8 @@
@ThreadSafe
public class CloudConfigFactory {
private static final Logger LOG = LoggerFactory.getLogger(CloudConfigFactory.class);
private static final int METADATA_RETRY_MAX_ATTEMPTS = 4;
private static final int METADATA_RETRY_INITIAL_DELAY_MS = 250;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: per an earlier conversation @SiyaoIsHiding and I decided to leave these as constants for now. We can move them to configurable values if an explicit need prevents itself (or perhaps as a future enhancement).

/**
* Creates a {@link CloudConfig} with information fetched from the specified Cloud configuration
* URL.
Expand Down Expand Up @@ -225,22 +228,51 @@ protected TrustManagerFactory createTrustManagerFactory(
@NonNull
protected BufferedReader fetchProxyMetadata(
@NonNull URL metadataServiceUrl, @NonNull SSLContext sslContext) throws IOException {
try {
HttpsURLConnection connection = (HttpsURLConnection) metadataServiceUrl.openConnection();
HttpsURLConnection connection = null;
int attempt = 0;
while (attempt < METADATA_RETRY_MAX_ATTEMPTS) {
attempt++;
connection = (HttpsURLConnection) metadataServiceUrl.openConnection();
connection.setSSLSocketFactory(sslContext.getSocketFactory());
connection.setRequestMethod("GET");
connection.setRequestProperty("host", "localhost");
return new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
} catch (ConnectException e) {
throw new IllegalStateException(
"Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
e);
} catch (UnknownHostException e) {
throw new IllegalStateException(
"Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
e);
// if this is the last attempt, throw
// else if the response code is 401, 421, or 5xx, retry
// else, throw
if (attempt < METADATA_RETRY_MAX_ATTEMPTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd be better off handling the attempt >= METADATA_RETRY_MAX_ATTEMPTS first. The benefit with that approach is that you're guaranteed that everything afterwards has not yet hit the max number of attempts (and is therefore subject to retry):

int rc = connection.getResponseCode();
if (attempt >= METADATA_RETRY_MAX_ATTEMPTS) {
  if (rc == HttpURLConnection.HTTP_OK) {
      try {
        return new BufferedReader(
            new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
      } catch (ConnectException e) {
        throw new IllegalStateException(
            "Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
            e);
      } catch (UnknownHostException e) {
        throw new IllegalStateException(
            "Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
            e);
      }
  }
  else {
        // TODO: Might also want to add some logging here (or maybe above) about what the observed return code was
        throw new IllegalStateException(
            "Unable to fetch metadata from cloud metadata service. Please make sure your cluster is not parked or terminated");
      }
  }
}

assert attempt < METADATA_RETRY_MAX_ATTEMPTS;

&& (connection.getResponseCode() == 401
|| connection.getResponseCode() == 421
|| connection.getResponseCode() >= 500)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and in the other spots throughout this code) you'd be better off using the constants defined by HttpURLConnection such as HTTP_OK for 200 (and so on). It makes the code a bit more descriptive about what's going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should also probably be pulled out into a simple helper method, something like this:

private boolean shouldRetry(int rc) {
   return connection.getResponseCode() == 401
              || connection.getResponseCode() == 421
              || connection.getResponseCode() >= 500;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear to me that this is the extent of response codes we want to handle here.

I agree with retrying on any 5xx error. It also probably does make sense to retry on a 401 and 421 given what we've seen from this service. But there are a whole set of 4xx response codes all of which indicate that the client did something wrong (from the server's perspective). We could see any of these codes coming off of the metadata service... what do we want to do in that case? We may need to spend some more time looking at this list and thinking about that.

There are also a lot of 3xx errors but most of those involve the server telling the client that the thing it's asking for is somewhere else. In that case retrying that specific request probably doesn't make sense... but I would expect a fully-formed HTTP client to send another (slightly different) request based on the feedback it got from the server. We're not building a fully-formed HTTP client, though, so I think I'm okay with not retrying 3xx errors for now.

There is part of me wondering if we shouldn't put a real HTTP client in here for this...

try {
Thread.sleep(METADATA_RETRY_INITIAL_DELAY_MS);
continue;
} catch (InterruptedException interruptedException) {
throw new IOException(
"Interrupted while waiting to retry metadata fetch", interruptedException);
}
}

// last attempt, still 421 or 401
if (connection.getResponseCode() == 421 || connection.getResponseCode() == 401) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to throw an exception of some kind in this case as long as the response code isn't a 200? I'm trying to imagine a situation in which we would want to proceed with trying to get an InputStream from the connection if we got a 3xx, 4xx or 5xx response.

There are other 2xx responses besides 200 (the spec has some more details) but none of those seem to apply in our case.

throw new IllegalStateException(
"Unable to fetch metadata from cloud metadata service. Please make sure your cluster is not parked or terminated");
}

try {
return new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
} catch (ConnectException e) {
throw new IllegalStateException(
"Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
e);
} catch (UnknownHostException e) {
throw new IllegalStateException(
"Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
e);
}
}
throw new DriverTimeoutException(
"Unable to fetch metadata from cloud metadata service"); // dead code
}

@NonNull
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import static com.github.tomakehurst.wiremock.client.WireMock.urlEqualTo;
import static com.github.tomakehurst.wiremock.client.WireMock.urlPathEqualTo;
import static com.github.tomakehurst.wiremock.core.WireMockConfiguration.wireMockConfig;
import static com.github.tomakehurst.wiremock.stubbing.Scenario.STARTED;
import static org.assertj.core.api.Assertions.catchThrowable;

import com.datastax.oss.driver.internal.core.ssl.SniSslEngineFactory;
Expand Down Expand Up @@ -139,6 +140,57 @@ public void should_throw_when_metadata_not_found() throws Exception {
assertThat(t).isInstanceOf(FileNotFoundException.class).hasMessageContaining("metadata");
}

@Test
public void should_retry_when_metadata_return_non_200() throws Exception {
// given, first attempt 421, subsequent attempts 200
mockHttpSecureBundle(secureBundle());
URL configFile = new URL("http", "localhost", wireMockRule.port(), BUNDLE_PATH);
stubFor(
any(urlPathEqualTo("/metadata"))
.inScenario("retry")
.whenScenarioStateIs(STARTED)
.willReturn(aResponse().withStatus(421))
.willSetStateTo("second-200"));
stubFor(
any(urlPathEqualTo("/metadata"))
.inScenario("retry")
.whenScenarioStateIs("second-200")
.willReturn(
aResponse()
.withStatus(200)
.withHeader("Content-Type", "application/json")
.withBody(jsonMetadata())));
// when
CloudConfigFactory cloudConfigFactory = new CloudConfigFactory();
CloudConfig cloudConfig = cloudConfigFactory.createCloudConfig(configFile);
// then
assertCloudConfig(cloudConfig);
}

@Test
public void should_throw_illegal_state_when_421() throws Exception {
// given status code 421
mockHttpSecureBundle(secureBundle());
URL configFile = new URL("http", "localhost", wireMockRule.port(), BUNDLE_PATH);
stubFor(any(urlPathEqualTo("/metadata")).willReturn(aResponse().withStatus(421)));
// when
CloudConfigFactory cloudConfigFactory = new CloudConfigFactory();
Throwable t = catchThrowable(() -> cloudConfigFactory.createCloudConfig(configFile));
assertThat(t).isInstanceOf(IllegalStateException.class).hasMessageContaining("metadata");
}

@Test
public void should_throw_IOException_when_500() throws Exception {
// given status code 500
mockHttpSecureBundle(secureBundle());
URL configFile = new URL("http", "localhost", wireMockRule.port(), BUNDLE_PATH);
stubFor(any(urlPathEqualTo("/metadata")).willReturn(aResponse().withStatus(500)));
// when
CloudConfigFactory cloudConfigFactory = new CloudConfigFactory();
Throwable t = catchThrowable(() -> cloudConfigFactory.createCloudConfig(configFile));
assertThat(t).isInstanceOf(IOException.class).hasMessageContaining("500");
}

@Test
public void should_throw_when_metadata_not_readable() throws Exception {
// given
Expand Down