SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

terpstra · 2020-07-31T20:43:02Z

Here is a picture of the change to the pipeline:
https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e

Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for:

atomics
sub-ECC-granularity writes
repaired ECC values

In exchange for this occupancy increase, a new register (REG) was added:
sram data output => REG => ecc-correction => ALU => sram write setup
This path was sufficiently long that it limited fMAX on many designs.
In designs without ECC and without atomics, this pipeline is optimized away.

Compared to the old pipeline, response latency is unchanged (by default) for:

reads (1)
atomics (1)
writes (1)
ECC-repaired reads (2)
ECC-repaired atomics (2)

Added a knob (sramReg) to set latency for all operations to 2.
With this knob disabled (the default), as in the old pipeline:

output data can flow uncorrected from the SRAM
output valid depends on correct ECC decode of SRAM output

With the knob enabled:

data flows from an ECC correction fed by registers
valid is a register

Type of change: other enhancement
Impact: API addition (no impact on existing code)
Development Phase: implementation

Release Notes
Improved cycle time for designs involving TLRAM.

Here is a picture of the change to the pipeline: https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for: - atomics - sub-ECC-granularity writes - repaired ECC values In exchange for this occupancy increase, a new register (REG) was added: sram data output => *REG* => ecc-correction => ALU => sram write setup This path was sufficiently long that it limited fMAX on many designs. In designs without ECC and without atomics, this pipeline is optimized away. Compared to the old pipeline, response latency is unchanged (by default) for: - reads (1) - atomics (1) - writes (1) - ECC-repaired reads (2) - ECC-repaired atomics (2) Added a knob (sramReg) to set latency for all operations to 2. With this knob disabled (the default), as in the old pipeline: - output data can flow uncorrected from the SRAM - output valid depends on correct ECC decode of SRAM output With the knob enabled: - data flows from an ECC correction fed by registers - valid is a register

terpstra · 2020-07-31T21:56:55Z

There were some follow-up changes I wanted to make that put down a combination of Xbar + TLRAM, but I think keeping those separate is probably a good idea.

aswaterman · 2020-08-01T01:03:47Z

src/main/scala/tilelink/SRAM.scala

@@ -23,6 +24,7 @@ class TLRAM(
    atomics: Boolean = false,
    beatBytes: Int = 4,
    ecc: ECCParams = ECCParams(),
+    sramReg: Boolean = false, // drive SRAM data output directly into a register => 1 cycle longer response


pipeline instead of sramReg?

I'm not sure that name is much better. What is actually happening is that I remove the 'ecc-ok' fast path.

BTW, it has been suggested to me offline that perhaps in the pipelined case we should increase ECC repair latency to 3. Thus, we could take the output data uncorrected from registers. Of course, the valid signal would still need to include a suppression for the correction-required case. Do you think this change is worth it?

I think the trade-off is between latency in valid vs. data. The current design has a flop for valid and correction on the data. The proposal has flops for data and detection on valid. This is the same trade-off we made for the old and new 'fast path'. If it was better there, why is it not also better here?

aswaterman · 2020-08-01T01:14:33Z

src/main/scala/tilelink/SRAM.scala

-    d_raw_data := mem.read(addr, ren)
-    when (wen) { mem.write(addr, coded, sel) }
+    val index = Cat(mask.zip((addr >> log2Ceil(beatBytes)).asBools).filter(_._1).map(_._2).reverse)
+    r_raw_data := mem.read(index, ren) holdUnless RegNext(ren)


Not that it matters much, but a while back I defined mem.readAndHold(index, ren) to summarize this pattern

aswaterman · 2020-08-01T01:21:06Z

src/main/scala/tilelink/SRAM.scala

+     * When stage D needs to perform a write (AMO, sub-ECC write, or ECC correction):
+     *   - there is a WaW or WaR hazard vs. the operation in stage R
+     *     - for sub-ECC writes and atomics, we ensure stage R has a bubble
+     *     - for ECC correction, we cause stage R to be replayed (and reject stage A twice)


What's the rationale for not using replay for both cases? If using a bubble instead of a replay would improve performance, I'd understand. But since there's a structural hazard from the D stage on the following cycle anyway, is the bubble actually better than the replay?

You're right that the performance in terms of latency and throughput would be the same. However, if I replay it requires reading the SRAM twice. I assumed that sub-ecc-granule writes might be common, so paying 2x power for them was unwise.

terpstra added the WIP label Jul 31, 2020

terpstra assigned hcook, aswaterman and solomatnikov Jul 31, 2020

terpstra removed the WIP label Jul 31, 2020

terpstra changed the title ~~[WIP] SRAM: repipeline the TLRAM into a 3 cycle RMW state machine~~ SRAM: repipeline the TLRAM into a 3 cycle RMW state machine Jul 31, 2020

terpstra added the proposal label Jul 31, 2020

aswaterman reviewed Aug 1, 2020

View reviewed changes

aswaterman approved these changes Aug 3, 2020

View reviewed changes

terpstra merged commit 75823b3 into master Aug 13, 2020

terpstra deleted the sram-rmw-pipeline branch August 13, 2020 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

terpstra commented Jul 31, 2020 •

edited

Loading

terpstra commented Jul 31, 2020

aswaterman Aug 1, 2020

terpstra Aug 1, 2020

terpstra Aug 1, 2020 •

edited

Loading

aswaterman Aug 1, 2020

aswaterman Aug 1, 2020

terpstra Aug 1, 2020

SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

Conversation

terpstra commented Jul 31, 2020 • edited Loading

terpstra commented Jul 31, 2020

aswaterman Aug 1, 2020

Choose a reason for hiding this comment

terpstra Aug 1, 2020

Choose a reason for hiding this comment

terpstra Aug 1, 2020 • edited Loading

Choose a reason for hiding this comment

aswaterman Aug 1, 2020

Choose a reason for hiding this comment

aswaterman Aug 1, 2020

Choose a reason for hiding this comment

terpstra Aug 1, 2020

Choose a reason for hiding this comment

terpstra commented Jul 31, 2020 •

edited

Loading

terpstra Aug 1, 2020 •

edited

Loading