Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: minor review comments #3279

Merged
merged 12 commits into from
May 7, 2018

Conversation

ramaraochavali
Copy link
Contributor

Signed-off-by: Rama rama.rao@salesforce.com
Description:
Fixes remaning minor comments in PR #3262

Risk Level: Low
Testing: Automated tests
Docs Changes: NA
Release Notes: NA

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mattklein123 @mrice32 @jmarantz
Addressed the review comments in the last PR. PTAL.

  • Specified the correct type
  • Removed lock in tlsHistogram
  • Made getTagsForName const

Signed-off-by: Rama <rama.rao@salesforce.com>
@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
}
EXPECT_TRUE(known_counter_exists);
EXPECT_TRUE(known_gauge_exists);
EXPECT_TRUE(known_histogram_exists);
// TODO : Fix the timing issue with stats flush - This is flaky.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is flaky. Some times it fails because of timing issue as mentioned in #3238. Not just in Valgrind, it fails some times in Mac also. Looks like it is flaky. Will look in to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look in to in more detail. I am suspecting some times (though very rarely) initial flush does not have this histogram. While the flakiness is very rare (I have observed only once), I am little concerned that other PRs might be affected by this. So I am commenting it for now. BTW, I am travelling today and over the weekend, mostly busy next week with meetings, will look in to this as soon as I can. @mattklein123 WDYT? Let me know if you have any better ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to fix this. Flaky tests are a drag on developer velocity. OTOH commenting out the test means that the although the test-coverage may show look OK, you really aren't testing the code :)

Any chance you can dive in and debug as part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, I can not do it immediately as I am travelling and early part of next week as well. If it is OK to leave this PR open, I can dig deep but I am little concerned about other PRs. If you have any quick ideas, do let me know, I can try quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM other than the flaky test.

@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
}
EXPECT_TRUE(known_counter_exists);
EXPECT_TRUE(known_gauge_exists);
EXPECT_TRUE(known_histogram_exists);
// TODO : Fix the timing issue with stats flush - This is flaky.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.

@@ -297,7 +299,6 @@ Histogram& ThreadLocalStoreImpl::ScopeImpl::tlsHistogram(const std::string& name
return **tls_ref;
}

std::unique_lock<std::mutex> lock(parent_.lock_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parent.tls_ must be not-nullptr above. Can you remove that check? Also, this makes me wonder if we can get in here when we are shutting down? I'm guessing maybe? I kind of wonder if during shutdown we should not create a TLS histogram at all, and return some type of null-histogram to write into, to avoid potentially creating a bunch of TLS histograms that we never store anywhere. TBH I think a similar problem exists for stats/guages but I think it could be much more pronounced for how the histogram code is now written given that we always do this TLS fetch on every write.

Copy link
Contributor Author

@ramaraochavali ramaraochavali May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the parent_tls_ nullptr check. Regarding the creating TLS histogram during shutdown, I think I will need to think through more so not doing this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this check as this test case https://github.com/envoyproxy/envoy/blob/master/test/common/stats/thread_local_store_test.cc#L208 fails. Do we need this test at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to merge this PR if we are good with metrics service test changes since it removes the flakiness and can do another PR for this parent_tls_ null check and think about creating TLS histogram in shutdown case.

@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
}
EXPECT_TRUE(known_counter_exists);
EXPECT_TRUE(known_gauge_exists);
EXPECT_TRUE(known_histogram_exists);
// TODO : Fix the timing issue with stats flush - This is flaky.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mattklein123 Thanks. That is a good idea I think. I have changed the test to wait for histogram, counter and gauges as well. Should fix the flaky behaviour.

Signed-off-by: Rama <rama.rao@salesforce.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing the flake. I'm not sure what's async that's causing this but it seems like a good remedy.

@@ -135,7 +136,8 @@ void ThreadLocalStoreImpl::releaseScopeCrossThread(ScopeImpl* scope) {
}
}

std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name, std::vector<Tag>& tags) {
std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name,
std::vector<Tag>& tags) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whitespace change makes me think you are running the format-fixing script with the clang setup on your machine rather than via docker. In particular it re-wrapped the line to fit in 80 columns when Envoy standard is 100. You probably want to run the format-fixer like this:

./ci/run_envoy_docker.sh './ci/do_ci.sh fix_format'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is because of const addition the length went up to 106 that is why it wrapped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I stared at that for 1 less second than needed & missed the 'const' addition :)

ASSERT(metrics_family.metric(0).has_timestamp_ms());
if (known_counter_exists && known_gauge_exists && known_histogram_exists) {
break;
while (known_counter_exists && known_gauge_exists && known_histogram_exists) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe bound this to a fixed number of loops or time-bound so unit-tests don't hang or spin if there's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will timeout if not found...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried running 'bazel test' where I injected a "while (true) {}" in a test method, and it seems to be running for a very long time. What do you expect would cause such a timeout?

Would you lose anything by (say) having this loop terminate after 100 iterations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My infinite-loop test finally timed out after 5 minutes. I bet you can make the test tighter than that. However I'd be happy too if we actually plan to fix this robustly in the future, with a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I have added TODO to fix it in a more robust way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to fix it here, but I think just setting a reasonable-but-not-ridiculous timeout, like a 20 second timer, would be the right way to make this more robust.

if (known_counter_exists && known_gauge_exists && known_histogram_exists) {
break;
while (known_counter_exists && known_gauge_exists && known_histogram_exists) {
envoy::service::metrics::v2::StreamMetricsMessage request_msg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also comment why the loop is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@@ -61,6 +61,7 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
bool known_gauge_exists = false;
// Sometimes stats do not come in the first flush cycle, this loop ensures that we wait till
// required stats are flushed.
// TODO: Figure out a more robust way to find out all required stats have been flushed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(ramaraochavali)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. added.

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
// required stats are flushed.
// TODO(ramaraochavali): Figure out a more robust way to find out all required stats have been
// flushed.
while (known_counter_exists && known_gauge_exists && known_histogram_exists) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes sense? Won't this never execute? I think you want the negation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah..Sorry. Good catch. Corrected the condition now and added the EXPECT checks at the end a s an additional validation

Signed-off-by: Rama <rama.rao@salesforce.com>
@jmarantz
Copy link
Contributor

jmarantz commented May 6, 2018

when this is merged I suspect we can close #3238 as well but it should be verified.

@mattklein123 mattklein123 merged commit 2a4840f into envoyproxy:master May 7, 2018
@mattklein123
Copy link
Member

@jmarantz can you check if #3238 is fixed?

@ramaraochavali ramaraochavali deleted the fix/minor_comments branch May 8, 2018 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants