You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The TLB (Translation Lookaside Buffer) is a very important piece of the emulator to cache recent virtual address translations, so we can have very fast fetch, load or store instructions. Making an efficient TLB is an area of research of its own, and there are many possible optimizations. We have in the past made experiments of many possible TLB redesigns to speedup virtual address translation, and we could continue this effort to improve emulator performance. The paper Optimizing Memory Translation Emulation in Full System Emulators is a good introduction on the topic and also provide many algorithms we could experiment, and we have experimented a few of them already.
Here I will share my biased opinions after experimenting different TLB redesigns, and possible future directions we could follow:
On making use of ASID in the TLB
My conclusion: Handling ASID makes no meaningful gains in performance, despite increasing hit ratio slightly. The kernel does not perform that many context switches to make this optimization shiny, even when spawning hundreds of processes. The context switch frequency is quite small, meaning that hit ratio is impacted only slightly by context switches. In summary its implementation complexity doesn't pay off its negligible performance gains. This optimization could be revisited in future if the kernel is reconfigured to be more preemptive with a higher context switch frequency, or if lots of interrupts start to happen for some reason, or with more demand for apps spawning hundreds of processes.
On making a TLB per privilege mode
My conclusion: The amount of privilege mode switch is not high enough, so this optimization makes negligible gains, most CPU intensive applications do not trigger privilege mode switches that often.
On making use of megapages in the TLB
My conclusion: The common apps and benchmarks that I have do not need or use megapages, and the extra cost of handling megapages every hit check downgrades the performance for all of them. However, using some new benchmarks that forces the use of megapages by allocating large amounts of memory and accessing them randomly then the performance improves considerably (up to 2x speedup) for some benchmarks. But bear in mind that to make the kernel use megapages the kernel had to be recompiled with "Transparent Huge Page Support", and explicitly enable use of transparent huge pages in userspace applications and also use custom bootargs. This means our kernel does not use megapages out of box in userspace, this must be explicitly enabled, and there are drawbacks when enabling this globally (more memory resources are used). In summary, while megapages can improve performance for some apps that make good use of large memory address spaces, it will downgrade performance for apps that use small memory address spaces. Unless there is real demand for large memory address spaces to the point that downgrading performance for apps using small address spaces makes sense, I don't see the need to enable this optimization as of now.
On implementing victim in the TLB
My conclusion: Victim improves hit ratio with little cost in the hit check, its implementation is quite simple, but still there is some small cost in fetch instruction code path that makes it not worth right now, despite coding it only after the miss branch. This optimization could be revisited in the future after a fetch instruction cache optimization. Update: We have introduced a simple fetch instruction page cache, this could be revisited.
Any other optimizations is yet to be researched
These conclusions were taken before the interpreter loop was optimized and we introduced a fetch instruction page cache, the negligible conclusions above may be less negligible by now. And will be even more negligible after we improve the instruction decoder (issue #48).
Possible solutions
We should create a set of benchmarks representing some real workloads and revisit each TLB redesign, and re-check if it will pay off. We should do any future experimentation after we optimize the instructor decoder first (issue #48), because only after we optimize it the TLB will start to consume a greater margin of CPU usage, therefore giving the opportunity for any optimization here to shine more.
In Hypervisor context, we should try experimenting after issues #62 and #60 are done first.
The text was updated successfully, but these errors were encountered:
Context
The TLB (Translation Lookaside Buffer) is a very important piece of the emulator to cache recent virtual address translations, so we can have very fast fetch, load or store instructions. Making an efficient TLB is an area of research of its own, and there are many possible optimizations. We have in the past made experiments of many possible TLB redesigns to speedup virtual address translation, and we could continue this effort to improve emulator performance. The paper Optimizing Memory Translation Emulation in Full System Emulators is a good introduction on the topic and also provide many algorithms we could experiment, and we have experimented a few of them already.
Here I will share my biased opinions after experimenting different TLB redesigns, and possible future directions we could follow:
My conclusion: Handling ASID makes no meaningful gains in performance, despite increasing hit ratio slightly. The kernel does not perform that many context switches to make this optimization shiny, even when spawning hundreds of processes. The context switch frequency is quite small, meaning that hit ratio is impacted only slightly by context switches. In summary its implementation complexity doesn't pay off its negligible performance gains. This optimization could be revisited in future if the kernel is reconfigured to be more preemptive with a higher context switch frequency, or if lots of interrupts start to happen for some reason, or with more demand for apps spawning hundreds of processes.
My conclusion: The amount of privilege mode switch is not high enough, so this optimization makes negligible gains, most CPU intensive applications do not trigger privilege mode switches that often.
My conclusion: The common apps and benchmarks that I have do not need or use megapages, and the extra cost of handling megapages every hit check downgrades the performance for all of them. However, using some new benchmarks that forces the use of megapages by allocating large amounts of memory and accessing them randomly then the performance improves considerably (up to 2x speedup) for some benchmarks. But bear in mind that to make the kernel use megapages the kernel had to be recompiled with "Transparent Huge Page Support", and explicitly enable use of transparent huge pages in userspace applications and also use custom bootargs. This means our kernel does not use megapages out of box in userspace, this must be explicitly enabled, and there are drawbacks when enabling this globally (more memory resources are used). In summary, while megapages can improve performance for some apps that make good use of large memory address spaces, it will downgrade performance for apps that use small memory address spaces. Unless there is real demand for large memory address spaces to the point that downgrading performance for apps using small address spaces makes sense, I don't see the need to enable this optimization as of now.
My conclusion: Victim improves hit ratio with little cost in the hit check, its implementation is quite simple, but still there is some small cost in fetch instruction code path that makes it not worth right now, despite coding it only after the miss branch. This optimization could be revisited in the future after a fetch instruction cache optimization. Update: We have introduced a simple fetch instruction page cache, this could be revisited.
These conclusions were taken before the interpreter loop was optimized and we introduced a fetch instruction page cache, the negligible conclusions above may be less negligible by now. And will be even more negligible after we improve the instruction decoder (issue #48).
Possible solutions
We should create a set of benchmarks representing some real workloads and revisit each TLB redesign, and re-check if it will pay off. We should do any future experimentation after we optimize the instructor decoder first (issue #48), because only after we optimize it the TLB will start to consume a greater margin of CPU usage, therefore giving the opportunity for any optimization here to shine more.
In Hypervisor context, we should try experimenting after issues #62 and #60 are done first.
The text was updated successfully, but these errors were encountered: