Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: long-running nexmark on madsim #5170

Closed
Tracked by #4180
wangrunji0408 opened this issue Sep 7, 2022 · 5 comments
Closed
Tracked by #4180

test: long-running nexmark on madsim #5170

wangrunji0408 opened this issue Sep 7, 2022 · 5 comments

Comments

@wangrunji0408
Copy link
Contributor

We hope to run nexmark for a long period in deterministic simulation to find more stability issues.

Some potential challenges:

  • limited compute resource: one test can only be run on 1 CPU core due to the determinism requirement.
  • limited memory capacity: long-running task will generate a huge amount of data which exceeds the memory capacity. we may need to dump data to the disk.
@lmatz
Copy link
Contributor

lmatz commented Nov 8, 2022

limited compute resource: one test can only be run on 1 CPU core due to the determinism requirement.

For the set of computations without any side effects, i.e. pure expression evaluation, can they be executed concurrently on multiple cores?

@wangrunji0408
Copy link
Contributor Author

wangrunji0408 commented Nov 15, 2022

I missed it. Sorry for the late response.

For the set of computations without any side effects, i.e. pure expression evaluation, can they be executed concurrently on multiple cores?

Yes. But unfortunately, it's hard for madsim to identify which task is pure computation. In practice, tasks without any side effects almost don't exist. More or less, they interact with each other through channels or shared states. Once it happens, we have to determine the order of the two tasks, otherwise the determinism will be broken. If we could intervene every time they make a side effect, then parallel execution seems possible. But I feel that it would take a lot of effort, the determinism would be hard to guarantee, and I'm afraid it can not be well-parallelized given the ubiquitous dependencies. 🥹

Thinking from the other side, simply speeding up the execution may not be the right direction for this problem. Concurrency bugs usually have a small depth, which means they can happen within a few steps if you carefully construct the schedule sequence. So they should be found quickly by massive simulations with different seeds. If they can't, the reason could be that some conditions are not satisfied. For example, the storage data is not large enough to trigger compaction. The only way to meet this condition from scratch is to run data ingestion for a long time. However, why do we have to run from scratch? If our simulator supports loading from a checkpoint, we can prepare a large dataset in advance and directly start from here. That's what we plan to do next.

@lmatz
Copy link
Contributor

lmatz commented Nov 16, 2022

Thanks for the detailed explanation!

If our simulator supports loading from a checkpoint, we can prepare a large dataset in advance and directly start from here. That's what we plan to do next.

It makes sense!

@TennyZhuang
Copy link
Contributor

Any updates?

@wangrunji0408
Copy link
Contributor Author

After some rethinking, I decided to make this issue low-priority, as long-running also makes it slow to reproduce. We can't benefit much from it compared with existing longevity test. Instead, I was trying to add more short-term fault injection tests (e.g. #7623) so that problems would be found more efficient.

@fuyufjh fuyufjh removed this from the release-1.1 milestone Aug 8, 2023
@wangrunji0408 wangrunji0408 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants