Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: simulated time block on sleep #10551

Merged
merged 30 commits into from
Apr 9, 2020

Conversation

jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Mar 26, 2020

Description: SimulatedTimeSystem::sleep() would advance time to trigger firing of timers, but does block on the threads executing those timers completing. That is the behavior does not match real-time sleep(), making it hard to retarget existing integration tests from real to simulated time.

However, there were some unit tests that are not multi-threaded, and used simulated time sleep in fashion where the timer callbacks would run on the same thread, and these need to retain the existing behavior. For these, a new API advanceTime() is now added, which enables callbacks but does not block waiting for them to complete.

Note that this PR does not yet solve all issues with sim-time in integration tests (#10568), as the infrastructure still relies on real-time sleeps in ways that are not fully understood. But it is a step in that direction.
Risk Level: low -- this is is a test/...-only PR.
Testing: /test/... in normal mode, and with tsan.
Docs Changes: n/a
Release Notes: n/a
Fixes: #10567

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…s in FakeUpstream

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…which seems to work for the cache test.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

failure in asan on IpVersions/ProxyFilterIntegrationTest.UpstreamTls/IPv4 in //test/extensions/filters/http/dynamic_forward_proxy:proxy_filter_integration_test

2020-03-30T15:59:20.7841856Z [ RUN      ] IpVersions/ProxyFilterIntegrationTest.UpstreamTls/IPv4
2020-03-30T15:59:20.7843072Z [2020-03-30 15:58:37.264][17][critical][assert] [test/integration/http_integration.cc:376] assert failure: result. Details: Expected new stream event, but got a different event.
2020-03-30T15:59:20.7843919Z AddressSanitizer:DEADLYSIGNAL
2020-03-30T15:59:20.7844544Z =================================================================
2020-03-30T15:59:20.7845498Z ==17==ERROR: AddressSanitizer: ABRT on unknown address 0xfffe00000011 (pc 0x7f33af45e428 bp 0x7ffcbc765f90 sp 0x7ffcbc765668 T0)
2020-03-30T15:59:20.7846545Z     #0 0x7f33af45e427 in raise (/lib/x86_64-linux-gnu/libc.so.6+0x35427)
2020-03-30T15:59:20.7847694Z     #1 0x7f33af460029 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x37029)
2020-03-30T15:59:20.7849259Z     #2 0x105158fc in Envoy::HttpIntegrationTest::waitForNextUpstreamRequest(std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >) /proc/self/cwd/test/integration/http_integration.cc:376:3

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…p on 0ms real time for now

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review April 5, 2020 23:28
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 5, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s), but failed to run 2 pipeline(s).

@jmarantz jmarantz changed the title WiP test: simulated time block on sleep test: simulated time block on sleep Apr 6, 2020
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is awesome. A few small comments and I will definitely like @alyssawilk to do a review also if possible. Also, can you try to do a big --runs_per_test run over all the tests that use simulated time to make sure we don't have any flakes? It should be all the tests you modified I guess.

/wait

Comment on lines +31 to +33
* Advances time forward by the specified duration. Timers may be triggered on
* their threads, but unlike advanceTimeWait(), this method does not block
* waiting for them to complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some text to each function as to why a test writer would want to use each one? It may not be immediately clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: added:

   * This function should be used in multi-threaded tests, where other
   * threads are running dispatcher loops. Integration tests should usually
   * use this variant.

and

   * This function should be used in single-threaded tests, in scenarios where
   * after time is advanced, the main test thread will run a dispatcher
   * loop. Unit tests will often use this variant.

Comment on lines 255 to 257
// This real-time polling delay should not be necessary. But without it,
// /test/extensions/filters/http/cache:cache_filter_integration_test fails
// about 40% of the time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- referencing #10568 . Changed.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 8, 2020

Running all tests that referenced SimulatedTime with tsan and runs_per_test=100. Will check back tomorrow.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 8, 2020

All the tests that mention SimulatedTime (which covers more than what was modified in this PR) passed with tsan and --runs_per_test=100.

There are still races (where there is still real-time there are races) I believe but they didn't come out in that run.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 8, 2020

@alyssawilk can you take another look? thanks.

mattklein123
mattklein123 previously approved these changes Apr 8, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly trusting Matt on what's called where, so only two remaining nits. The comments are a huge help, thanks.

As said on slack I'd love to see more than 100 runs since flakes generally show up 1:1000 or less.
As a random aside, the only_one_thread naming makes me sad because the function which is supposed to be called by multi-threaded tests checks "only one thread" and I had to spelunk in order to figure out why that was Ok.

const Duration real_time_poll_delay(
std::min(std::chrono::duration_cast<Duration>(std::chrono::milliseconds(50)), duration));
const MonotonicTime end_time = monotonicTime() + duration;

while (true) {
bool cont = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better naming please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to timeout_not_reached

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forget to upload? I'm not seeing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope; I thought I'd try intentionally not uploading until you were fine with my proposed changes, to avoid wasted CI churn :)

if (alarms_.empty()) {
// If no alarms are pending, sleep till the end time.
setMonotonicTimeAndUnlock(end_time);
mutex.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on the unusual lock pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment:

    // This function runs with the caller-provided mutex held. We need to
    // hold this->mutex_ while accessing the timer-queue and blocking on
    // callbacks completing. To avoid potential deadlock we must drop
    // the caller's mutex before taking ours. We also must care to avoid
    // break/continue/return/throw during this non-RAII lock operation.

@mattklein123
Copy link
Member

As a random aside, the only_one_thread naming makes me sad because the function which is supposed to be called by multi-threaded tests checks "only one thread" and I had to spelunk in order to figure out why that was Ok.

+1 to change this and make it more clear. I had to look it up also.

@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 9, 2020

I made some renaming comment changes per suggestion. PTAL just at the comments; won't push till we finish iterating on the wording.

I also successfully ran 1000 iters of tsan on 67 tests that mentioned SimulatedTime. That took 19 hours.

I have not done anything about the only_one_thread confusion. I'll take responsibility for that but it's not really part of this PR; can we iterate on better naming for that in a follow-up? Do you have a specific suggestion as to what would be clearer to you?

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay on tests, thanks for the overnight run!
Also thanks for picking up the rename - I'd totally prefer a follow-up so you're good there.

const Duration real_time_poll_delay(
std::min(std::chrono::duration_cast<Duration>(std::chrono::milliseconds(50)), duration));
const MonotonicTime end_time = monotonicTime() + duration;

while (true) {
bool cont = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forget to upload? I'm not seeing it?

@alyssawilk
Copy link
Contributor

oh bleh, you even said, but I only saw the last email. yeah SGTM.

@alyssawilk
Copy link
Contributor

Also fwiw this is all I had to say - headed afk to see if I can rest my headache away but I'm fine with Matt merging if he signs off else I'll TAL when I'm feeling better

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 9, 2020

@mattklein123 you want to take a final look (maybe just at the local-var name-change and comment?)

@mattklein123 mattklein123 merged commit f11be0b into envoyproxy:master Apr 9, 2020
@jmarantz jmarantz deleted the simtime-block-on-sleep branch April 9, 2020 23:04
oschaaf added a commit to oschaaf/nighthawk that referenced this pull request Apr 10, 2020
Make unit tests to use advanceTimeWait in favor of sleep

(see envoyproxy/envoy#10551)

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
mum4k pushed a commit to envoyproxy/nighthawk that referenced this pull request Apr 11, 2020
Make unit tests to use advanceTimeWait in favor of sleep

(see envoyproxy/envoy#10551)

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test: SimulatedTimeSystem::sleep() should block on timer-callbacks all completing
3 participants