Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit new TLB optimizations #61

Open
edubart opened this issue May 29, 2023 · 0 comments
Open

Revisit new TLB optimizations #61

edubart opened this issue May 29, 2023 · 0 comments
Assignees
Labels
optimization Optimization

Comments

@edubart
Copy link
Contributor

edubart commented May 29, 2023

Context

The TLB (Translation Lookaside Buffer) is a very important piece of the emulator to cache recent virtual address translations, so we can have very fast fetch, load or store instructions. Making an efficient TLB is an area of research of its own, and there are many possible optimizations. We have in the past made experiments of many possible TLB redesigns to speedup virtual address translation, and we could continue this effort to improve emulator performance. The paper  Optimizing Memory Translation Emulation in Full System Emulators is a good introduction on the topic and also provide many algorithms we could experiment, and we have experimented a few of them already.

Here I will share my biased opinions after experimenting different TLB redesigns, and possible future directions we could follow:

  • On making use of ASID in the TLB
    My conclusion: Handling ASID makes no meaningful gains in performance, despite increasing hit ratio slightly. The kernel does not perform that many context switches to make this optimization shiny, even when spawning hundreds of processes. The context switch frequency is quite small, meaning that hit ratio is impacted only slightly by context switches. In summary its implementation complexity doesn't pay off its negligible performance gains. This optimization could be revisited in future if the kernel is reconfigured to be more preemptive with a higher context switch frequency, or if lots of interrupts start to happen for some reason, or with more demand for apps spawning hundreds of processes.
  • On making a TLB per privilege mode
    My conclusion: The amount of privilege mode switch is not high enough, so this optimization makes negligible gains, most CPU intensive applications do not trigger privilege mode switches that often.
  • On making use of megapages in the TLB
    My conclusion: The common apps and benchmarks that I have do not need or use megapages, and the extra cost of handling megapages every hit check downgrades the performance for all of them. However, using some new benchmarks that forces the use of megapages by allocating large amounts of memory and accessing them randomly then the performance improves considerably (up to 2x speedup) for some benchmarks. But bear in mind that to make the kernel use megapages the kernel had to be recompiled with "Transparent Huge Page Support", and explicitly enable use of transparent huge pages in userspace applications and also use custom bootargs. This means our kernel does not use megapages out of box in userspace, this must be explicitly enabled, and there are drawbacks when enabling this globally (more memory resources are used). In summary, while megapages can improve performance for some apps that make good use of large memory address spaces, it will downgrade performance for apps that use small memory address spaces. Unless there is real demand for large memory address spaces to the point that downgrading performance for apps using small address spaces makes sense, I don't see the need to enable this optimization as of now.
  • On implementing victim in the TLB
    My conclusion: Victim improves hit ratio with little cost in the hit check, its implementation is quite simple, but still there is some small cost in fetch instruction code path that makes it not worth right now, despite coding it only after the miss branch. This optimization could be revisited in the future after a fetch instruction cache optimization. Update: We have introduced a simple fetch instruction page cache, this could be revisited.
  • Any other optimizations is yet to be researched

These conclusions were taken before the interpreter loop was optimized and we introduced a fetch instruction page cache, the negligible conclusions above may be less negligible by now. And will be even more negligible after we improve the instruction decoder (issue #48).

Possible solutions

We should create a set of benchmarks representing some real workloads and revisit each TLB redesign, and re-check if it will pay off. We should do any future experimentation after we optimize the instructor decoder first (issue #48), because only after we optimize it the TLB will start to consume a greater margin of CPU usage, therefore giving the opportunity for any optimization here to shine more.

In Hypervisor context, we should try experimenting after issues #62 and #60 are done first.

@edubart edubart added the optimization Optimization label May 29, 2023
@edubart edubart self-assigned this May 29, 2023
@edubart edubart moved this to Todo in Machine Emulator SDK May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Optimization
Projects
Status: Todo
Development

No branches or pull requests

1 participant