Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #376: AwsXrayRemoteSampler doesn’t poll for update #377

Conversation

felixscheinost
Copy link
Contributor

@felixscheinost felixscheinost commented Jun 28, 2022

Description:

In the default configuration pollingIntervalNanos = 3 * 10^11 so pollingIntervalMillis / 100 > Integer.MAX_VALUE.

Switch to storing everything in milliseconds instead. This should be a non-breaking change as all the nanoseconds related logic isn’t part of the public API.

Existing Issue(s):

#376

Testing:

I don't know how this could be tested as all the logic isn't part of the public API.

Best would be a sort of integration test where time could be skipped forward and we could observe a request. Is there some code already set up in this project to do something like this?
EDIT: I noticed that there both real integration tests and a test that tests polling with a very short polling interval.

I tested this manually by shadowing the two affected classes in my project and manually testing there.

@felixscheinost felixscheinost requested a review from a team June 28, 2022 15:43
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 28, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: felixscheinost / name: Felix Scheinost (2539e61)

@github-actions github-actions bot requested review from anuraaga and willarmiros June 28, 2022 15:43
@felixscheinost felixscheinost force-pushed the feature/fix-aws-xray-remote-sampler branch 2 times, most recently from 27d8b19 to 6051652 Compare June 29, 2022 18:29
Copy link
Contributor

@willarmiros willarmiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this & and contributing this fix! Just to confirm, did you run the integration tests in the package as part of this change?

Comment on lines 114 to 116
clock.nanoTime()
+ TimeUnit.MILLISECONDS.toNanos(
AwsXrayRemoteSampler.DEFAULT_TARGET_INTERVAL_MILLIS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this value is greater than INT_MAX? Is that a possibility?

@@ -148,8 +149,8 @@ private void getAndUpdateSampler() {
}

private void scheduleSamplerUpdate() {
long delay = pollingIntervalNanos + RANDOM.nextInt(jitterNanos);
pollFuture = executor.schedule(this::getAndUpdateSampler, delay, TimeUnit.NANOSECONDS);
long delay = pollingIntervalMillis + RANDOM.nextInt(jitterMillis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick with nanos and switch to Random.longs(0, jitterNanos) instead of switching to millis? It's nice to avoid dropping precision mysteriously when possible, even in practice it generally shouldn't matter here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This requires significantly less changes. I pushed an updated commit.

@felixscheinost felixscheinost force-pushed the feature/fix-aws-xray-remote-sampler branch from 6051652 to fbf3c6e Compare June 30, 2022 09:29
* returns the duration until the next scheduled sampler update or null if no next update is scheduled yet
*/
@Nullable
public Duration getNextSamplerUpdateScheduledDuration() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently only required for the test. Would that be okay?

I don't have an idea how to test this properly because we don't want to wait for 5 minutes in the test, right? As only long polling intervals trigger this bug so we should use a long polling interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think the maintainers would have final say but I don't think we want to expose additional public methods in the sampler like this. Is it possible to mock the system clock and trick the test into thinking 5 minutes have passed? Or just otherwise verifying the interval isn't negative when set to 5 minutes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be package-private and still accessed from test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I made the method package-private.

@felixscheinost felixscheinost force-pushed the feature/fix-aws-xray-remote-sampler branch from fbf3c6e to bf09a9a Compare June 30, 2022 09:32
@felixscheinost
Copy link
Contributor Author

@willarmiros Does running the integration tests mean running ./gradlew :aws-xray:awsTest?

If yes, running this command returns an error on my machine:

> Task :aws-xray:compileAwsTestJava FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':aws-xray:compileAwsTestJava'.
> Could not resolve all files for configuration ':aws-xray:awsTestCompileClasspath'.
   > Could not find io.opentelemetry:opentelemetry-exporter-otlp-trace:.
     Required by:
         project :aws-xray

Copy link
Contributor

@willarmiros willarmiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixscheinost for the test failure it looks like you're not resolving the package version for some reason... Maybe you need to do a build first? https://github.com/open-telemetry/opentelemetry-java-contrib#getting-started

Not sure if @anuraaga would have any other ideas why this could happen.

* returns the duration until the next scheduled sampler update or null if no next update is scheduled yet
*/
@Nullable
public Duration getNextSamplerUpdateScheduledDuration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think the maintainers would have final say but I don't think we want to expose additional public methods in the sampler like this. Is it possible to mock the system clock and trick the test into thinking 5 minutes have passed? Or just otherwise verifying the interval isn't negative when set to 5 minutes?

@anuraaga
Copy link
Contributor

anuraaga commented Jul 1, 2022

Sorry for the confusion on awsTest - it's not run with the build currently since we don't have AWS credentials for integration tests in this repo. It seems to have drifted and doesn't build anymore... It's also not a true integration test as it just polls and expects manually checking the logs after operating the console. @willarmiros I'd suggest someone at AWS checks out the PR and runs it manually until we could have AWS creds in this repo for running integ tests.

@SuppressWarnings(
"OptionalGetWithoutIsPresent") // getAsLong: RANDOM.longs should always provide a value, so
// we can do an unchecked unwrap here
long delay = pollingIntervalNanos + RANDOM.longs(0, jitterNanos).findFirst().getAsLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace the jitterNanos field with RANDOM.longs(0, jitterNanos) to not construct the stream every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not very used to working with streams. How would I get the next value from a stream? From reading examples and the documentation a little streams aren't supposed to use that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "unlimited streams" like this RANDOM.longs() aren't very common so examples out there may look a bit weird. But in this case, just creating the stream once and using it over and over should work OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used .iterator() to convert the stream to an Iterator<Long> and use next() to get the next item. As the stream should be infinite calling hasNext() should not be necessary.

@felixscheinost
Copy link
Contributor Author

I think one way to test that the scheduling works correctly without the getNextSamplerUpdateScheduledDuration method I added would be to add (also package-private) a way to override scheduler. The scheduler could also be set in the builder and passeed into the sampler.

This way we could use a mocked scheduler that could verify what's scheduleed.

@felixscheinost felixscheinost force-pushed the feature/fix-aws-xray-remote-sampler branch from bf09a9a to c3f9ab2 Compare July 4, 2022 08:32
Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - just small nit left

// https://github.com/open-telemetry/opentelemetry-java-contrib/issues/376
@Test
void testJitterTruncation() {
sampler.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same pattern as defaultInitialSampler(), not operating on the test's "base sampler" and defining a local one within try/resources.

In the default configuration `pollingIntervalNanos = 3 * 10^11` so `pollingIntervalMillis / 100 > Integer.MAX_VALUE`.

Switch to storing the jitter in a `long` as well.
@felixscheinost felixscheinost force-pushed the feature/fix-aws-xray-remote-sampler branch from c3f9ab2 to 87a675d Compare July 4, 2022 09:25
@anuraaga anuraaga merged commit dd4d335 into open-telemetry:main Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants