-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582
Conversation
Here is a picture of the change to the pipeline: https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for: - atomics - sub-ECC-granularity writes - repaired ECC values In exchange for this occupancy increase, a new register (REG) was added: sram data output => *REG* => ecc-correction => ALU => sram write setup This path was sufficiently long that it limited fMAX on many designs. In designs without ECC and without atomics, this pipeline is optimized away. Compared to the old pipeline, response latency is unchanged (by default) for: - reads (1) - atomics (1) - writes (1) - ECC-repaired reads (2) - ECC-repaired atomics (2) Added a knob (sramReg) to set latency for all operations to 2. With this knob disabled (the default), as in the old pipeline: - output data can flow uncorrected from the SRAM - output valid depends on correct ECC decode of SRAM output With the knob enabled: - data flows from an ECC correction fed by registers - valid is a register
There were some follow-up changes I wanted to make that put down a combination of Xbar + TLRAM, but I think keeping those separate is probably a good idea. |
@@ -23,6 +24,7 @@ class TLRAM( | |||
atomics: Boolean = false, | |||
beatBytes: Int = 4, | |||
ecc: ECCParams = ECCParams(), | |||
sramReg: Boolean = false, // drive SRAM data output directly into a register => 1 cycle longer response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline
instead of sramReg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that name is much better. What is actually happening is that I remove the 'ecc-ok' fast path.
BTW, it has been suggested to me offline that perhaps in the pipelined case we should increase ECC repair latency to 3. Thus, we could take the output data uncorrected from registers. Of course, the valid signal would still need to include a suppression for the correction-required case. Do you think this change is worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trade-off is between latency in valid vs. data. The current design has a flop for valid and correction on the data. The proposal has flops for data and detection on valid. This is the same trade-off we made for the old and new 'fast path'. If it was better there, why is it not also better here?
d_raw_data := mem.read(addr, ren) | ||
when (wen) { mem.write(addr, coded, sel) } | ||
val index = Cat(mask.zip((addr >> log2Ceil(beatBytes)).asBools).filter(_._1).map(_._2).reverse) | ||
r_raw_data := mem.read(index, ren) holdUnless RegNext(ren) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it matters much, but a while back I defined mem.readAndHold(index, ren)
to summarize this pattern
* When stage D needs to perform a write (AMO, sub-ECC write, or ECC correction): | ||
* - there is a WaW or WaR hazard vs. the operation in stage R | ||
* - for sub-ECC writes and atomics, we ensure stage R has a bubble | ||
* - for ECC correction, we cause stage R to be replayed (and reject stage A twice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for not using replay for both cases? If using a bubble instead of a replay would improve performance, I'd understand. But since there's a structural hazard from the D stage on the following cycle anyway, is the bubble actually better than the replay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that the performance in terms of latency and throughput would be the same. However, if I replay it requires reading the SRAM twice. I assumed that sub-ecc-granule writes might be common, so paying 2x power for them was unwise.
Here is a picture of the change to the pipeline:
https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e
Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for:
In exchange for this occupancy increase, a new register (REG) was added:
sram data output => REG => ecc-correction => ALU => sram write setup
This path was sufficiently long that it limited fMAX on many designs.
In designs without ECC and without atomics, this pipeline is optimized away.
Compared to the old pipeline, response latency is unchanged (by default) for:
Added a knob (sramReg) to set latency for all operations to 2.
With this knob disabled (the default), as in the old pipeline:
With the knob enabled:
Type of change: other enhancement
Impact: API addition (no impact on existing code)
Development Phase: implementation
Release Notes
Improved cycle time for designs involving TLRAM.