stats: minor review comments #3279

ramaraochavali · 2018-05-03T15:14:28Z

Signed-off-by: Rama rama.rao@salesforce.com
Description:
Fixes remaning minor comments in PR #3262

Risk Level: Low
Testing: Automated tests
Docs Changes: NA
Release Notes: NA

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-05-03T15:28:19Z

@mattklein123 @mrice32 @jmarantz
Addressed the review comments in the last PR. PTAL.

Specified the correct type
Removed lock in tlsHistogram
Made getTagsForName const

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-05-03T16:29:16Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
    }
    EXPECT_TRUE(known_counter_exists);
    EXPECT_TRUE(known_gauge_exists);
-    EXPECT_TRUE(known_histogram_exists);
+    // TODO : Fix the timing issue with stats flush - This is flaky.


This check is flaky. Some times it fails because of timing issue as mentioned in #3238. Not just in Valgrind, it fails some times in Mac also. Looks like it is flaky. Will look in to this.

I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.

I need to look in to in more detail. I am suspecting some times (though very rarely) initial flush does not have this histogram. While the flakiness is very rare (I have observed only once), I am little concerned that other PRs might be affected by this. So I am commenting it for now. BTW, I am travelling today and over the weekend, mostly busy next week with meetings, will look in to this as soon as I can. @mattklein123 WDYT? Let me know if you have any better ideas.

It would be very good to fix this. Flaky tests are a drag on developer velocity. OTOH commenting out the test means that the although the test-coverage may show look OK, you really aren't testing the code :)

Any chance you can dive in and debug as part of this PR?

As I mentioned, I can not do it immediately as I am travelling and early part of next week as well. If it is OK to leave this PR open, I can dig deep but I am little concerned about other PRs. If you have any quick ideas, do let me know, I can try quickly.

The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.

mrice32

This LGTM other than the flaky test.

mrice32 · 2018-05-03T20:37:55Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
    }
    EXPECT_TRUE(known_counter_exists);
    EXPECT_TRUE(known_gauge_exists);
-    EXPECT_TRUE(known_histogram_exists);
+    // TODO : Fix the timing issue with stats flush - This is flaky.


I would think there should be some way to ensure that a sample was recorded and that a flush was called strictly after that, but I'm not familiar with the internals of this test.

mattklein123 · 2018-05-04T08:09:03Z

source/common/stats/thread_local_store.cc

@@ -297,7 +299,6 @@ Histogram& ThreadLocalStoreImpl::ScopeImpl::tlsHistogram(const std::string& name
    return **tls_ref;
  }

-  std::unique_lock<std::mutex> lock(parent_.lock_);


I think parent.tls_ must be not-nullptr above. Can you remove that check? Also, this makes me wonder if we can get in here when we are shutting down? I'm guessing maybe? I kind of wonder if during shutdown we should not create a TLS histogram at all, and return some type of null-histogram to write into, to avoid potentially creating a bunch of TLS histograms that we never store anywhere. TBH I think a similar problem exists for stats/guages but I think it could be much more pronounced for how the histogram code is now written given that we always do this TLS fetch on every write.

I have removed the parent_tls_ nullptr check. Regarding the creating TLS histogram during shutdown, I think I will need to think through more so not doing this now.

reverted this check as this test case https://github.com/envoyproxy/envoy/blob/master/test/common/stats/thread_local_store_test.cc#L208 fails. Do we need this test at all?

I think it is better to merge this PR if we are good with metrics service test changes since it removes the flakiness and can do another PR for this parent_tls_ null check and think about creating TLS histogram in shutdown case.

mattklein123 · 2018-05-04T08:11:18Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

@@ -94,7 +94,8 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
    }
    EXPECT_TRUE(known_counter_exists);
    EXPECT_TRUE(known_gauge_exists);
-    EXPECT_TRUE(known_histogram_exists);
+    // TODO : Fix the timing issue with stats flush - This is flaky.


The thing that comes to mind is to just wait on the histogram existing, vs. it existing after a single metrics request. I.e., allow there to be more than 1 request until you find the histogram. This will paper of the timing issue if we don't have it in time for the first flush.

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-05-04T09:42:52Z

@mattklein123 Thanks. That is a good idea I think. I have changed the test to wait for histogram, counter and gauges as well. Should fix the flaky behaviour.

Signed-off-by: Rama <rama.rao@salesforce.com>

jmarantz

thanks for fixing the flake. I'm not sure what's async that's causing this but it seems like a good remedy.

jmarantz · 2018-05-04T12:36:43Z

source/common/stats/thread_local_store.cc

@@ -135,7 +136,8 @@ void ThreadLocalStoreImpl::releaseScopeCrossThread(ScopeImpl* scope) {
  }
 }

-std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name, std::vector<Tag>& tags) {
+std::string ThreadLocalStoreImpl::getTagsForName(const std::string& name,
+                                                 std::vector<Tag>& tags) const {


This whitespace change makes me think you are running the format-fixing script with the clang setup on your machine rather than via docker. In particular it re-wrapped the line to fit in 80 columns when Envoy standard is 100. You probably want to run the format-fixer like this:

./ci/run_envoy_docker.sh './ci/do_ci.sh fix_format'

i think it is because of const addition the length went up to 106 that is why it wrapped.

Sorry. I stared at that for 1 less second than needed & missed the 'const' addition :)

jmarantz · 2018-05-04T12:38:08Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

-      ASSERT(metrics_family.metric(0).has_timestamp_ms());
-      if (known_counter_exists && known_gauge_exists && known_histogram_exists) {
-        break;
+    while (known_counter_exists && known_gauge_exists && known_histogram_exists) {


maybe bound this to a fixed number of loops or time-bound so unit-tests don't hang or spin if there's a problem.

I think it will timeout if not found...

I just tried running 'bazel test' where I injected a "while (true) {}" in a test method, and it seems to be running for a very long time. What do you expect would cause such a timeout?

Would you lose anything by (say) having this loop terminate after 100 iterations?

My infinite-loop test finally timed out after 5 minutes. I bet you can make the test tighter than that. However I'd be happy too if we actually plan to fix this robustly in the future, with a TODO.

I think that makes sense. I have added TODO to fix it in a more robust way

We don't have to fix it here, but I think just setting a reasonable-but-not-ridiculous timeout, like a 20 second timer, would be the right way to make this more robust.

jmarantz · 2018-05-04T12:42:00Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

-      if (known_counter_exists && known_gauge_exists && known_histogram_exists) {
-        break;
+    while (known_counter_exists && known_gauge_exists && known_histogram_exists) {
+      envoy::service::metrics::v2::StreamMetricsMessage request_msg;


Could you also comment why the loop is necessary?

Signed-off-by: Rama <rama.rao@salesforce.com>

jmarantz · 2018-05-04T14:39:43Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

@@ -61,6 +61,7 @@ class MetricsServiceIntegrationTest : public HttpIntegrationTest,
    bool known_gauge_exists = false;
    // Sometimes stats do not come in the first flush cycle, this loop ensures that we wait till
    // required stats are flushed.
+    // TODO: Figure out a more robust way to find out all required stats have been flushed.


TODO(ramaraochavali)?

sorry. added.

Signed-off-by: Rama <rama.rao@salesforce.com>

mattklein123 · 2018-05-04T23:52:35Z

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc

+    // required stats are flushed.
+    // TODO(ramaraochavali): Figure out a more robust way to find out all required stats have been
+    // flushed.
+    while (known_counter_exists && known_gauge_exists && known_histogram_exists) {


I don't think this makes sense? Won't this never execute? I think you want the negation?

Ah..Sorry. Good catch. Corrected the condition now and added the EXPECT checks at the end a s an additional validation

Signed-off-by: Rama <rama.rao@salesforce.com>

jmarantz · 2018-05-06T21:20:08Z

when this is merged I suspect we can close #3238 as well but it should be verified.

mattklein123 · 2018-05-07T14:03:09Z

@jmarantz can you check if #3238 is fixed?

ramaraochavali added 2 commits May 3, 2018 20:42

minor review comments

b77ab11

Signed-off-by: Rama <rama.rao@salesforce.com>

formatted

77f43f9

Signed-off-by: Rama <rama.rao@salesforce.com>

commented histogram expect

ea46d71

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali commented May 3, 2018

View reviewed changes

ggreenway assigned jmarantz, mattklein123 and mrice32 May 3, 2018

mrice32 approved these changes May 3, 2018

View reviewed changes

mattklein123 reviewed May 4, 2018

View reviewed changes

fixed the flaky histogram behav

2866d9c

Signed-off-by: Rama <rama.rao@salesforce.com>

revert parent.tls_ check

89a754e

Signed-off-by: Rama <rama.rao@salesforce.com>

jmarantz reviewed May 4, 2018

View reviewed changes

ramaraochavali added 4 commits May 4, 2018 18:24

Merge branch 'master' into fix/minor_comments

7a1706d

Signed-off-by: Rama <rama.rao@salesforce.com>

added comments

743ec5d

Signed-off-by: Rama <rama.rao@salesforce.com>

doc update

af97e6f

Signed-off-by: Rama <rama.rao@salesforce.com>

add todo

165a4a3

Signed-off-by: Rama <rama.rao@salesforce.com>

jmarantz approved these changes May 4, 2018

View reviewed changes

ramaraochavali added 2 commits May 4, 2018 20:14

corrected TODO

2cec102

Signed-off-by: Rama <rama.rao@salesforce.com>

formatted

b4bc633

Signed-off-by: Rama <rama.rao@salesforce.com>

mrice32 approved these changes May 4, 2018

View reviewed changes

mattklein123 reviewed May 4, 2018

View reviewed changes

corrected the condition

72730c1

Signed-off-by: Rama <rama.rao@salesforce.com>

mattklein123 approved these changes May 5, 2018

View reviewed changes

mattklein123 merged commit 2a4840f into envoyproxy:master May 7, 2018

ramaraochavali deleted the fix/minor_comments branch May 8, 2018 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: minor review comments #3279

stats: minor review comments #3279

ramaraochavali commented May 3, 2018

ramaraochavali commented May 3, 2018

ramaraochavali May 3, 2018

mrice32 May 3, 2018

ramaraochavali May 4, 2018

jmarantz May 4, 2018

ramaraochavali May 4, 2018

mattklein123 May 4, 2018

mrice32 left a comment

mrice32 May 3, 2018

mattklein123 May 4, 2018

ramaraochavali May 4, 2018 •

edited

Loading

ramaraochavali May 4, 2018

ramaraochavali May 4, 2018

mattklein123 May 4, 2018

ramaraochavali commented May 4, 2018

jmarantz left a comment

jmarantz May 4, 2018

ramaraochavali May 4, 2018

jmarantz May 4, 2018

jmarantz May 4, 2018

ramaraochavali May 4, 2018

jmarantz May 4, 2018

jmarantz May 4, 2018

ramaraochavali May 4, 2018

mrice32 May 4, 2018

jmarantz May 4, 2018

ramaraochavali May 4, 2018

jmarantz May 4, 2018

ramaraochavali May 4, 2018

mattklein123 May 4, 2018

ramaraochavali May 5, 2018

jmarantz commented May 6, 2018

mattklein123 commented May 7, 2018

stats: minor review comments #3279

stats: minor review comments #3279

Conversation

ramaraochavali commented May 3, 2018

ramaraochavali commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrice32 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali May 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali commented May 4, 2018

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarantz commented May 6, 2018

mattklein123 commented May 7, 2018

ramaraochavali May 4, 2018 •

edited

Loading