-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: minor review comments #3279
stats: minor review comments #3279
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
@mattklein123 @mrice32 @jmarantz
|
Signed-off-by: Rama <rama.rao@salesforce.com>
@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest, | |||
} | |||
EXPECT_TRUE(known_counter_exists); | |||
EXPECT_TRUE(known_gauge_exists); | |||
EXPECT_TRUE(known_histogram_exists); | |||
// TODO : Fix the timing issue with stats flush - This is flaky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is flaky. Some times it fails because of timing issue as mentioned in #3238. Not just in Valgrind, it fails some times in Mac also. Looks like it is flaky. Will look in to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look in to in more detail. I am suspecting some times (though very rarely) initial flush does not have this histogram. While the flakiness is very rare (I have observed only once), I am little concerned that other PRs might be affected by this. So I am commenting it for now. BTW, I am travelling today and over the weekend, mostly busy next week with meetings, will look in to this as soon as I can. @mattklein123 WDYT? Let me know if you have any better ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very good to fix this. Flaky tests are a drag on developer velocity. OTOH commenting out the test means that the although the test-coverage may show look OK, you really aren't testing the code :)
Any chance you can dive in and debug as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, I can not do it immediately as I am travelling and early part of next week as well. If it is OK to leave this PR open, I can dig deep but I am little concerned about other PRs. If you have any quick ideas, do let me know, I can try quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM other than the flaky test.
@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest, | |||
} | |||
EXPECT_TRUE(known_counter_exists); | |||
EXPECT_TRUE(known_gauge_exists); | |||
EXPECT_TRUE(known_histogram_exists); | |||
// TODO : Fix the timing issue with stats flush - This is flaky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.
@@ -297,7 +299,6 @@ Histogram& ThreadLocalStoreImpl::ScopeImpl::tlsHistogram(const std::string& name | |||
return **tls_ref; | |||
} | |||
|
|||
std::unique_lock<std::mutex> lock(parent_.lock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parent.tls_
must be not-nullptr above. Can you remove that check? Also, this makes me wonder if we can get in here when we are shutting down? I'm guessing maybe? I kind of wonder if during shutdown we should not create a TLS histogram at all, and return some type of null-histogram to write into, to avoid potentially creating a bunch of TLS histograms that we never store anywhere. TBH I think a similar problem exists for stats/guages but I think it could be much more pronounced for how the histogram code is now written given that we always do this TLS fetch on every write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the parent_tls_
nullptr check. Regarding the creating TLS histogram during shutdown, I think I will need to think through more so not doing this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this check as this test case https://github.com/envoyproxy/envoy/blob/master/test/common/stats/thread_local_store_test.cc#L208 fails. Do we need this test at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to merge this PR if we are good with metrics service test changes since it removes the flakiness and can do another PR for this parent_tls_ null check and think about creating TLS histogram in shutdown case.
@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest, | |||
} | |||
EXPECT_TRUE(known_counter_exists); | |||
EXPECT_TRUE(known_gauge_exists); | |||
EXPECT_TRUE(known_histogram_exists); | |||
// TODO : Fix the timing issue with stats flush - This is flaky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.
Signed-off-by: Rama <rama.rao@salesforce.com>
@mattklein123 Thanks. That is a good idea I think. I have changed the test to wait for histogram, counter and gauges as well. Should fix the flaky behaviour. |
Signed-off-by: Rama <rama.rao@salesforce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing the flake. I'm not sure what's async that's causing this but it seems like a good remedy.
@@ -135,7 +136,8 @@ void ThreadLocalStoreImpl::releaseScopeCrossThread(ScopeImpl* scope) { | |||
} | |||
} | |||
|
|||
std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name, std::vector<Tag>& tags) { | |||
std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name, | |||
std::vector<Tag>& tags) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whitespace change makes me think you are running the format-fixing script with the clang setup on your machine rather than via docker. In particular it re-wrapped the line to fit in 80 columns when Envoy standard is 100. You probably want to run the format-fixer like this:
./ci/run_envoy_docker.sh './ci/do_ci.sh fix_format'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is because of const addition the length went up to 106 that is why it wrapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I stared at that for 1 less second than needed & missed the 'const' addition :)
ASSERT(metrics_family.metric(0).has_timestamp_ms()); | ||
if (known_counter_exists && known_gauge_exists && known_histogram_exists) { | ||
break; | ||
while (known_counter_exists && known_gauge_exists && known_histogram_exists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe bound this to a fixed number of loops or time-bound so unit-tests don't hang or spin if there's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will timeout if not found...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried running 'bazel test' where I injected a "while (true) {}" in a test method, and it seems to be running for a very long time. What do you expect would cause such a timeout?
Would you lose anything by (say) having this loop terminate after 100 iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My infinite-loop test finally timed out after 5 minutes. I bet you can make the test tighter than that. However I'd be happy too if we actually plan to fix this robustly in the future, with a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I have added TODO to fix it in a more robust way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to fix it here, but I think just setting a reasonable-but-not-ridiculous timeout, like a 20 second timer, would be the right way to make this more robust.
if (known_counter_exists && known_gauge_exists && known_histogram_exists) { | ||
break; | ||
while (known_counter_exists && known_gauge_exists && known_histogram_exists) { | ||
envoy::service::metrics::v2::StreamMetricsMessage request_msg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also comment why the loop is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@@ -61,6 +61,7 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest, | |||
bool known_gauge_exists = false; | |||
// Sometimes stats do not come in the first flush cycle, this loop ensures that we wait till | |||
// required stats are flushed. | |||
// TODO: Figure out a more robust way to find out all required stats have been flushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(ramaraochavali)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry. added.
Signed-off-by: Rama <rama.rao@salesforce.com>
// required stats are flushed. | ||
// TODO(ramaraochavali): Figure out a more robust way to find out all required stats have been | ||
// flushed. | ||
while (known_counter_exists && known_gauge_exists && known_histogram_exists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this makes sense? Won't this never execute? I think you want the negation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah..Sorry. Good catch. Corrected the condition now and added the EXPECT checks at the end a s an additional validation
Signed-off-by: Rama <rama.rao@salesforce.com>
when this is merged I suspect we can close #3238 as well but it should be verified. |
Signed-off-by: Rama rama.rao@salesforce.com
Description:
Fixes remaning minor comments in PR #3262
Risk Level: Low
Testing: Automated tests
Docs Changes: NA
Release Notes: NA