State of the art branch predictors perform significantly worse than they should #1341

useredsa · 2024-07-10T16:52:06Z

useredsa
Jul 10, 2024

Discussion Topic

Gem5 supposedly offers state-of-the-art branch predictor implementations such as TAGE-SC-L and the multiperspective perceptron. However, several of these implementations lack proper support for speculative execution and history unwinding in some or all of its components. For example, the basic TAGE class has speculative execution, but the statistical corrector (-SC) does not. The problem is that if a branch predictor is used to predict the outcome of a branch based only on committed history, not only does the predictor have less information, but the difference in the number of instructions between the predicted branch and the last comitted instruction is variable (dependent on the state of execution). Thus, there is not a single context (input data or state) under which the predictor has to learn the outcome of a branch, but a myriad of them. The result is a huge MPKI and a performance downgrade bigger than if these (suppossedly better) versions were not used. The impact is so big that the TAGE-SC-L predictor, the one usually used in articles (due to it being suppossedly better than the rest), performs similar to a basic tournament predictor.

We would like to have a discussion on how bad the issue is, what is the impact and how to proceed.

Results

Gem5 can be configured with 7 branch predictors that come included in the simulator. These are the following (sorted by expected performance).

Local
BiMode
Tournament
L-TAGE
Multiperspective Perceptron
Multiperspective Perceptron TAGE
TAGE-SC-L
Predictors 1-3 are very basic. They are of course interesting examples, and were very relevant at their time, but are not close to today's state-of-the-art predictors. Predictor 4 is the first of a series of modern predictors which use multiple history lengths and other techniques to achieve good performance. Predictor 7 is the one considered to be the most performant, the winner of the last Championship Branch Prediction. It is based on L-TAGE but adds a statistical correction unit that can override the prediction of the main component.

The previous figure shows the MPKI (of comitted branches) and execution times of the 7 branch predictors that come with Gem5, and a fixed version of Gem5 of which I'll talk later. As one can see, L-TAGE performs much better than the predictors 1-3. It has roughly 35 % less MPKI. However, predictors 4-6 behave worse than L-TAGE (including TAGE-SC-L). In fact, the multiperspective perceptron performs even worse than the most basic branch predictor. The reason for this is the lack of speculative execution.

We have adapted a TAGE-SC-L implementation with speculative exectuiong coming from the Scarab simulator. The results for this implementation are shown below. Both implementations mimic the TAGE-SC-L from the Championship Branch Predictor 5 (which is non-speculative). However, this one has support for unwinding the history updates to the statistical corrector. And this version has 56 % less MPKI and is 7 % faster than the TAGE-SC-L implementation included with gem5. Compared to L-TAGE, included in Gem5, it has 13.5 % MPKI and is more than 1 % faster.

Impact and Actions

Gem5 is used for research, both to try out ideas an to share them in the form of articles. The typical users of gem5 will define a high-performance core, implement the changes they need to test their idea, possibly not even in the branch predictor, and compare the results to a previous version. Therefore, most users have used and will use a TAGE-SC-L branch predictor. However, they are in reality using a broken version which behaves like a tournament predictor. We have seen this in already published articles and we believe it will continue to happen. The solution is to fix the implementations, but this will probably take time.

As action, I suggest removing the broken branch predictors 5-7 from the simulator until they are fixed. Otherwise, they will continue to be used. People should be using L-TAGE if a better version is not available.

I would also like to provide our adaptation of Scarab's implementation, hosted in this repository, for people to be able to use a TAGE-SC-L implementation right away, at least while the versions are not fixed. The maintainers of Gem5 are free to incorporate it into the repository or adapt it to their needs.

pranith · 2024-07-16T05:50:37Z

pranith
Jul 16, 2024

Thanks for the review of the existing implementations in gem5 @useredsa. The maintainers of gem5 usually only review the code submitted to the project for inclusion. Students and researchers (users like yourself) who want to improve the project file bugs found during their research and submit patches to gem5 to fix them. Since this part is usually a long drawn out process for large changes and requires significant dedication on part of the submitter. Would you please file the issues identified as bugs and submit your fixes as patches that can then be reviewed and merged?

0 replies

useredsa · 2024-07-16T07:11:42Z

useredsa
Jul 16, 2024
Author

Hello, @pranith,

Actually, I am not asking for a code review. There were two reasons to open a discussion instead of a merge request. The first was to have a discussion on the steps to follow beforehand. The second was to discuss what are the quick actionables. Just as you mentioned, the merge request process is usually a long drawn process. For that reason, we think it is reasonable to remove the broken implementations rather than keep them. Because it's harming articles which are using the current version. And then, at some point, add whatever works. But maybe there are other actions that can be taken or arguments against this. We can discuss them here.

Best regards,

0 replies

giactra · 2024-07-16T13:18:15Z

giactra
Jul 16, 2024
Maintainer

Tagging @powerjg and @andysan here.
Thanks @useredsa for reporting this major issue. We had a gem5 developer meeting last Thursday, unfortunately I got to see this discussion only afterwards and we didn't have a chance to properly comment on this...

IMHO we could eventually consider disabling TAGE-SC-L, but since from your report it looks like the issue is confined to the statistical corrector and that we should have some time before next release, I actually propose to find resources to fix this before 24.1 (and make this a priority). I wonder whether having full speculative support in the SC is something which is required to see a speedup in the FDIP PR attempt

7 replies

mattsinc Jul 16, 2024
Maintainer

I am teaching our graduate-level microarchitecture course in the fall and students always need project suggestions. @andysan 's suggestions of more/different regression testing as well as @useredsa 's suggestion about fixes both seem like things that students in the class could do ... if we need resources. Happy to discuss further or see how we could enable that, if you all don't think you all have the bandwidth.

Of course if we feel this requires experts to fix, then students in my course are not the right choice. Just a thought ...

OdnetninI Jul 16, 2024

I like your solution @andysan
Marking modules and configurations as "broken, experimental, ..." seems like a nit solution to make users understand what they are running.
I am guessing that a SimObject attribute that could be updated on runtime could be the solution. Then a simple flag at the time of running the simulator (or another SimObject attribute to ignore_broken/ignore_experimental) will be enough.

@mattsinc, I am not sure about this TAGE_SC_L fix, but other parts of gem5 are not complex, but they require "resources".

mattsinc Jul 16, 2024
Maintainer

Writing functional and correctness regression testing indeed requires resources, but also requires someone to understand the behavior well enough to write said tests -- which I believe makes this a reasonable contribution for a class project. But again, not a hard requirement -- happy for others to do this if they feel it's too important or too hard.

aperais Aug 27, 2024

Just a tidbit here, related to unwinding the predictor state on mispredictions. There are different API calls for flushing because of a branch misprediction and because of a memory dependency order violation, and if I recall correctly, the latter does not go through the exact same paths in the TAGE code (tage.hh/cc and tage_base.hh/cc), meaning that even if history were correctly restored and recomputed on control mispredictions, they would still be stale on memory order mispredictions. Unfortunately I don't have time to dig deeper into stable or develop right now, but I wanted to mention it as this is another aspect that causes branch predictor implementations to be incorrect.

pranith Oct 11, 2024

* In our study to pinpoint this issue, we also found that the BTB is a major bottleneck. The default BTB is directly-mapped, even with 4k entries it is a big problem.

This is now fixed with the merging of PR #1537.

dhschall · 2024-11-11T18:13:01Z

dhschall
Nov 11, 2024

Question: has someone tried the TAGE-SC-L version running with the Atomic core and compared it against L-TAGE? The speculative update shouldn't matter there, and TAGE-SC-L should perform better.
I have a student who tested it, and he observed a significant performance difference (TAGE-SC-L is worse). Is that on par with your results or do we have a bug somewhere?
If TAGE-SC-L is indeed worse than L-TAGE on the atomic core, it's not only the speculative update that is buggy but also the actual implementation.

3 replies

TIANYU-Li-HFUT Nov 18, 2024

Firstly, I haven't actually tested Atomic, but in gem5, LTAGE's storage overhead of nearly 32KB naturally performs better than the 8KB TAGE_SC_L.

yongjiehuang Nov 19, 2024

To TIANYU-Li-HFUT:
Thanks for the idea. But actually, what we used is L-TAGE with 64KB TAGE and TAGE_SC_L_64KB, instead of 32KB and 8KB.

TIANYU-Li-HFUT Nov 19, 2024

This is an interesting result. If possible, could you share some detailed parameter configurations and experimental results? (I am also debugging the branch predictor in gem5.)(Additionally, the information I found indicates that AtomicCPU is a non-pipelined CPU. Is branch prediction meaningless for a single-cycle CPU?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gem5

State of the art branch predictors perform significantly worse than they should #1341

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

State of the art branch predictors perform significantly worse than they should #1341

Discussion Topic

Results

Impact and Actions

Replies: 4 comments · 10 replies

useredsa Jul 16, 2024 Author

giactra Jul 16, 2024 Maintainer

mattsinc Jul 16, 2024 Maintainer

mattsinc Jul 16, 2024 Maintainer

Replies: 4 comments 10 replies

useredsa
Jul 16, 2024
Author

giactra
Jul 16, 2024
Maintainer

mattsinc Jul 16, 2024
Maintainer

mattsinc Jul 16, 2024
Maintainer