Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration tests should use simulated time #4160

Open
jmarantz opened this issue Aug 15, 2018 · 12 comments · Fixed by #4340 or #4343
Open

integration tests should use simulated time #4160

jmarantz opened this issue Aug 15, 2018 · 12 comments · Fixed by #4340 or #4343
Labels

Comments

@jmarantz
Copy link
Contributor

jmarantz commented Aug 15, 2018

Description:

A slow testing environment causes test failures and flakes, such as the one described in #4135. It also makes it hard to run tests under a debugger, which, in addition to helping to debug tests, is a strategy I use sometimes to learn code.

My feeling is that the root of this problem (and others) is singleton patterns like:
ProdMonotonicTimeSource::instance_

In the past I've used strategies where these get passed downward as the world gets constructed, so singletons are never referenced in code that needs to be tested (or even exist, usually).

My hope is that injecting MonotonicTime refs into key classes in the system will eliminate the need to reference singletons deeply.

@jmarantz
Copy link
Contributor Author

jmarantz commented Aug 15, 2018

related: I'm wondering if there's a strong reason to have SystemTimeSource and MonotonicTimeSource as separate class hierarchies, as opposed to just having monotonicTime() and currentTime() as two separate methods on the same class? This might make injection a little less cumbersome.

@mattklein123
Copy link
Member

@jmarantz +1. Kill those statics! 😉

related: I'm wondering if there's a strong reason to have SystemTimeSource and MonotonicTimeSource as separate class hierarchies, as opposed to just having monotoniceTime() and currentTime() as two separate methods on the same classes? This might make injection a little less cumbersome.

No reason. I would merge.

@jmarantz
Copy link
Contributor Author

Here's my rough plan:

  1. test infrastructure: inject time everywhere #4180 which adds TimeSource as a container around SystemTimeSource and MonotonicTimeSource and removes the static time-sources, changing zero behavior but touching 120+ files. The production changes here are minimal but the test changes are pervasive, and a new helper class TestTime had to be instantiated in many places, and uses real-time (which is what most tests use).

  2. Unit test consistency: currently there are many tests that today use a mixture of mock time and real time. We should migrate tests incrementally to use mock-time (or maybe some kind of fake time) by default, so this can be done in a few different PRs.

  3. Migrate integration tests to mock (or fake) time. I think we will also need need a test-version of DispatcherImpl and/or TimerImpl that work with simulated time.

  4. Remove SystemTimeSource and MonotonicTimeSource as individual classes, passing only TimeSource everywhere.

htuch pushed a commit that referenced this issue Aug 23, 2018
…4226)

The management of the timer_ variable in the test class was a little funky, and looked leak-prone. This PR cleans this up a little,. Without this PR, #4180 memory-leaks on this test, I think due to overwriting timer_. This is all toward resolution of #4160

Risk Level: Low.
Testing: //test/common/config:grpc_mux_impl_test
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
htuch pushed a commit that referenced this issue Aug 24, 2018
This is one step toward addressing #4160 though this strategy may need to be backed out and re-worked. See the bug comments for more detail on the plan. The main value of this PR is to get rid of the Prod*TimeSource::instance_ statics, so we can inject time into tests. This PR doesn't actually inject time into tests differently; it just makes time injectable.

Apologies about the gigantic PR. The follow-ups can be more incremental.

Risk Level: medium: this should be non-functional, but a lot of lines of code changed. There is a fairly significant risk of merge issues too, so when merging, after this, we should be careful.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
htuch pushed a commit that referenced this issue Aug 26, 2018
…ime from source. (#4248)

Time should be injected in all production code, so that it can ultimately be tested using mock time or simulated time. Production code that directly instantiates ProdSystemTimeSource makes this impossible, and this PR ensures that 'format' checks fail if that is violated. This partially addresses issue #4160.

Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Aug 26, 2018

Note to self:

  % rgrepn 'std::chrono::steady_clock::now()' "*.cc"|wc -l
  35
  % rgrepn 'std::chrono::steady_clock::now()' "*.h"|wc -l
  24

@jmarantz
Copy link
Contributor Author

jmarantz commented Aug 28, 2018

Jotting down a rough roadmap to resolving this:

  1. test infrastructure: inject time everywhere #4180 and and time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. #4248 merged, establishing TimeSource injection in many places.
  2. Cleanup PR test infra: Remove timeSource() from the ClusterManager api #4247 pending, addressing some of @htuch's comments from test infrastructure: inject time everywhere #4180
  3. PR time: make TimeSource virtual, and remove MonotonicTimeSource and ProdTimeSource #4261, merging MonotonicTimeSource and SystemTimeSource into just TimeSource.
  4. Will request a review for time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers #4257, which derives Event::TimeSystem from TimeSource, establishing a simulated system of timers, used by Dispatcher.
  5. Create a new SimulatedTimeSystem variant, which maintains a heap of pending timers and depends on testbeds to advance time manually, e.g. through TimeSystem::sleep(duration), a wrapper around CondVar::waitWithTimeout or similar: WIP: create SimulateTimeSystem for running tests with simulated time. #4284
  6. rip through the 59 references to std::chrono::steady_clock::now() and make them use the time system: time: Remove all usages of chrono::*::now() and instead plumb in TimeSystem or TimeSource #4434

As I look at integration tests now, blocking with timeouts is done in the main test thread and it's not clear to me which thread should be advancing simulated time yet.

htuch pushed a commit that referenced this issue Aug 28, 2018
Follow-up to #4180 -- removes ClusterManager::timeSource() as an API. Another step toward resolving #4160

Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
htuch pushed a commit that referenced this issue Sep 5, 2018
…e with simulated timers (#4257)

Unifies the TimeSource so that system and monotonic sources are always handled similarly (mocks, real-time, or in the future, simulated), and derives a new interface, Event::TimeSystem, which managers timers. This is a pure refactor; none of the operation or semantics of any production or test code is changed with this PR. This is another step in resolving #4160.

Risk Level: medium, just because PR is large and merge conflicts happen, but it should be pretty safe
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
htuch pushed a commit that referenced this issue Sep 7, 2018
…stem() as appropriate (#4343)

In #4257, Event::TimeSystem was introduced to augment the TimeSource abstraction to include timers. Most places in the system that had previously held onto a TimeSource held it in a member variable time_source_ and, if appropriate, exposed it as a method timeSource(), and #4257 changed those to Event::TimeSystem without renaming them, in order to minimize diffs. This PR just renames the members and methods to time_system_ and timeSystem() as appropriate. This is part of the chain of CLs to resolve #4160.

Risk Level: low -- no functional risk, but depending on timing of when this gets merged, an easily-remedied build breakage could arise.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Sep 7, 2018

Can we leave this open? We still don't have steps 5 and 6 above done.

@htuch
Copy link
Member

htuch commented Sep 7, 2018

Reopening as suggested by @jmarantz.

@htuch htuch reopened this Sep 7, 2018
zuercher pushed a commit that referenced this issue Sep 11, 2018
This makes it possible to incorporate simulated time (#4340) into integration tests, driving toward resolution of #4160 .

Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

One of the symptoms of this not being resolved, I suspect, is a tendency for this test to timeout sometimes:

TIMEOUT: //test/extensions/filters/network/thrift_proxy:integration_test (Summary)

htuch pushed a commit that referenced this issue Sep 17, 2018
…System or TimeSource (#4434)

In order to fully control time during tests, direct calls to system-provided time functions must not be used. These had previously not been fixed. In at least one case I think there was real-time deltas being compared in a test, which really needed to be mocked, so that change had to occur in this PR. Another step toward resolution of #4160

Risk Level: medium, due to size of PR.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
htuch pushed a commit that referenced this issue Sep 17, 2018
Add an alternate TimeSystem implementation which keeps manually-adjusted time and timer-callbacks making sense together. This is intended for use in most if not all unit tests, and many integration tests. This is another step to resolve #4160.

Risk Level: low -- this is a new test-only class that's not used anywhere yet.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz reopened this Jan 28, 2019
@jmarantz jmarantz added the no stalebot Disables stalebot from closing an issue label Jan 28, 2019
jmarantz added a commit to jmarantz/envoy that referenced this issue Jan 28, 2019
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@mattklein123 mattklein123 removed no stalebot Disables stalebot from closing an issue labels Feb 21, 2019
mpuncel added a commit to mpuncel/envoy that referenced this issue Feb 26, 2019
This partially addresses envoyproxy#4160

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
dnoe pushed a commit that referenced this issue Feb 27, 2019
This partially addresses #4160

Signed-off-by: Michael Puncel mpuncel@squareup.com

Description: Migrate to simulated time instead of real time in router tests
Risk Level: low
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a
Partially Fixes #4160
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
This partially addresses envoyproxy#4160

Signed-off-by: Michael Puncel mpuncel@squareup.com

Description: Migrate to simulated time instead of real time in router tests
Risk Level: low
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a
Partially Fixes envoyproxy#4160

Signed-off-by: Fred Douglas <fredlas@google.com>
@jmarantz jmarantz reopened this Mar 29, 2020
@jmarantz
Copy link
Contributor Author

This was not fully closed. There are still a few barriers to enabling sim-time for all integration tests.

@zhxie
Copy link
Contributor

zhxie commented Jan 26, 2022

As mentioned in #18157, TestRealTimeSystem (generated by GlobalTimeSystem) in MockDispatcher and SimulatedTimeSystem in MockStreamInfo are conflict and may cause errors in tests. Since we have planned to switch the time system in GlobalTimeSystem, I can take a try if no one is working on it.

@jmarantz
Copy link
Contributor Author

We can only use one time system at a time, either a Simulated one or a Real one.

The "GlobalTimeSystem" is just a helper class to get access to a time-system reference to the Real or Simulated system that already has been declared (or to declare a default real-time system if one is not already declared).

To make a particular test use SimulatedTime you should only need to make its test-class derive from Event::TestUsingSimulatedTime, from test/test_common/simulated_time_system.h .

You may experience issues with simulated time if there are parts of the test that depend on real time and don't provide a mechanism to inject time. One place I think this might come up with is gRPC which doesn't (afaik) provide a time-injection hook, but may use sleeps and real-time timeouts that don't work well with simulated time. In the absence of code under test with a dependence on real time elapsing, simulated time should make tests run faster with fewer flakes.

@rgs1
Copy link
Member

rgs1 commented Apr 25, 2022

FWIW, I haven't seen //test/extensions/filters/network/thrift_proxy:integration_test fail in a while. Hopefully #20471 helped too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants