Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous cgroup controller configuration #17

Open
tsturzl opened this issue May 22, 2021 · 13 comments
Open

Asynchronous cgroup controller configuration #17

tsturzl opened this issue May 22, 2021 · 13 comments
Assignees

Comments

@tsturzl
Copy link
Collaborator

tsturzl commented May 22, 2021

We should be able to concurrently configure each of the cgroup subsystems to hopefully see some performance improvements. One thing to note is write order within a controller is important to keep track of since values can be validated against each other by the kernel. Initially it may be best to write configs in each controller in series and have each controller be executing concurrently with the next, and eventually move into more granular concurrency.

@utam0k
Copy link
Member

utam0k commented May 22, 2021

I think this is a wonderful thing to be able to achieve. I would love to verify and implement this.

@tsturzl
Copy link
Collaborator Author

tsturzl commented May 25, 2021

I'll probably be building out some kind of profiling and benchmarking on this to see what kind of improvements this feature brings. I've got a lot of the initial work done for making the manager async. I need to do some work to get the unit tests working again. Hoping to have the integration tests running on this by tonight.

@utam0k
Copy link
Member

utam0k commented May 26, 2021

I was trying it out at my local computer, and I am very aware that cgroups is making things run slower than before.
Therefore, I still think that improving performance by async/await will be quite an interesting challenge.
For benchmarking, here may be helpful.
TaKO8Ki/frum#16

@tsturzl
Copy link
Collaborator Author

tsturzl commented Jun 10, 2021

I'm going to reattempt this. I found out that the actual implementation of almost all async runtimes for handling file operations is to just move file IO off into a thread pool. I naively assumed it chose between different kernel APIs depending on the build targets. I think in our case this isn't awesome. The cost of starting a threadpool diminishes and possibly outright negates the potential performance improvements of concurrent cgroups configuration. I think what I'd like to do is approach this from a lower level. It would be an interesting experiment to make a small program to see if the AIO or epoll work well for interacting with a cgroup controller, otherwise lib_uring might address all of the possible caveats and dead ends in the other async IO kernel features. As far as I'm aware AIO and epoll are both incapable of doing the more typical buffered IO, which some virtual filesystems support exclusively. If the cgroups vfs handles direct IO well then AIO or epoll might be preferable simply for the fact that older kernel version support them, while io uring is relatively new and only available in kernel version 5.1 or greater.

However IO uring features are much better methods of doing async IO, the only system calls required are to setup the shared ring buffers with the kernel. This means less context switching, you just add file operations to the outgoing ring buffer and listen for completion on the completion buffer. I also have some recent experience hacking on IO uring, and it's also probably a safer interface than AIO and a faster interface than epoll and AIO. However neither do us much good for supporting Windows. It might be preferable in our case to hide this feature behind a compile time feature flag. With all of the active development on cgroups v1 and v2 it might actually be better to have the async version of the controllers broken out into their own half of the cgroup module so I can have the async controllers trail behind the progress of the synchronous implementation.

@utam0k
Copy link
Member

utam0k commented Jun 11, 2021

@tsturzl
I didn't understand much about it because I don't know much about it yet, but I basically agree with you.
However, I thought that 5.1 is too up-to-date and if possible, it would be better to change the behavior depending on the version.
However, I think it is an interesting challenge to use the most advanced features of the linux kernel, which no other container runtime uses.

@tsturzl
Copy link
Collaborator Author

tsturzl commented Jun 12, 2021

Further development on this issue will be postponed and effectively blocked by work happening on the cgroups implementations #9 and #78.

Some details and notes on the desired implementation. I think the ring buffers should be setup in the manager and passed through into each controller, this will likely mean some changes to the controller trait. Additionally the opening of file handles will need to be moved to the point in which we await the writes, likely in the Controller::apply function. This is because the file descriptor must remain open and outlive the IO operation until a completion is received. So opening the file earlier in the process and passing those file descriptors by reference will likely be required here to prevent use after free. For the most part rio provides a lot of tolerance against misuse, with the only caveat I'm aware of being the potential for a use after free. In this case with the design outlined above, we should be able to completely avoid this, and in cases where there are multiple writes to a single file this is better design anyway.

@utam0k
Copy link
Member

utam0k commented Jun 12, 2021

@tsturzl Why not create a feature branch and work on it there?

@tsturzl
Copy link
Collaborator Author

tsturzl commented Jun 12, 2021

@utam0k I expect things to change quite a bit with cgroups, and that seems like it will be hard to resolve the conflicts between async and the baseline implementation. It wouldn't be able to merge in until most of the cgroup work is done anyway without disrupting other work in flight. I might decide to start before the cgroups are complete, but I'd feel like it's more worth the effort once cgroups is close to completion. I might start working on some of the boilerplate in the meantime, like much of common is going to need some rework for cgroups. Once cgroups are complete I don't expect it to take me much time to implement async, I'd assume most of my time is going to be spent testing. For this it would be nice to have good unit and integration testing for cgroups v2 also, since I know runtime-tools doesn't yet support v2.

@tsturzl
Copy link
Collaborator Author

tsturzl commented Aug 14, 2021

I'm providing a long overdue update on the progress here. As previously mentioned I had halted my attempts at this, and to reiterate some of the previous hurdles I found out that pretty much every async framework for Rust was pushing file operations off into a threadpool instead of doing async IO for file operations. Turns out Linux has only recently supported a decent way of doing file IO called liburing, which basically shares 2 ring buffers with the kernel allowing for queuing of IO operations and their completions. I've long followed the Rio projects for Rust which currently seems like the safest way to use this IO system. I had picked up this effort again recently and have been making progress with Rio now that development on cgroups has been mostly completed.

Where I'm at now is I have a majority of the cgroups v1 features implemented with liburing. Unfortunately it's a bit slow moving as the interactions with the filesystem have had to change since the behavior between uring and typical blocking buffered IO are different. There have been a few caveats, such as now instead of making several writes to the allow and denied devices files I just write the entire contents in a single operation, since it would seem the different means of writing have inconsistent behavior with each other. In fact the devices file system interactions seems very odd in the fact that if you write to a normal file the same way you write to the devices files you will end up continuously overwriting the last entry, whereas when you write to the devices files each entry automatically somehow ends up on a new line instead of overwriting. So in general the cgroup filesystem doesn't work the way a normal filesystem would, and there are lots of caveats in doing asynchronous IO on these files. With these changes I've also had to rework some tests, which has taken some time. I suspect by the time I get to integration testing v1 cgroups I'll have a lot of issues to resolve.

I have yet to setup profiling or benchmarks, but I have plans to start benchmarking once I have cgv1 complete. My hope is to finish cgv1, bring it up to date with the main branch, and then start doing comparative benchmarks using criterion to see if there are any significant performance improvements. I also plan to setup flamegraph to hopefully get insightful performance profiling for youki, and get an actual understanding of the performance improvements between async cgroups and blocking IO.

If cgroups v1 is passing tests and has notable performance gains I will be hopefully creating a PR for cgv1 before the end of the month. I will then take a look at applying the same changes to cgv2 probably in a separate PR.

@utam0k
Copy link
Member

utam0k commented Aug 15, 2021

@tsturzl Excellent progress. I'm very curious to see how this code turns out. Is it possible to create a Draft PR, even if it does not work? I'm very interested in this experiment and I'm excited to see how it turns out.

@utam0k
Copy link
Member

utam0k commented Aug 18, 2021

@tsturzl
I've noticed a few things that I need to be aware of and I'll share them with you.
If you are dealing with multi-threading in this work, you may need to be careful with namespaces.

@tsturzl
Copy link
Collaborator Author

tsturzl commented Aug 18, 2021

@utam0k I'm avoiding threads since I think they'll have more overhead than benefit, so hopefully that won't be an issue, though that very interesting. I'm currently working through some issues with my implementation using Rio. I am however partially worried with some of the runtime behaviors and caveats. I've just seen that tokio released their uring implementation and I'm curious to give that a try also to see if it's has less pitfalls. I believe I can get the Rio implementation working, but one thing that worries me is maintainability. It would seem Tokio's implementation of uring looks more like the Rust File IO and handles a lot of the pitfalls. It is however the very first release and Rio is a more mature project. I think it's still worth investigating both solutions.

@utam0k utam0k removed this from the First release milestone Nov 18, 2021
@yihuaf
Copy link
Collaborator

yihuaf commented Jul 21, 2023

For future reference, youki should be single threaded. We use fork to create and set up the container init process. Forking without immediately exec into a new binary in multi-threaded environment is considered undefined behavior by libc authors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants