Synthesis with -nowidelut gives drastically better results #4798

t-wallet · 2024-12-05T11:36:49Z

Version

0.45+139

On which OS did this happen?

Linux

Reproduction Steps

While synthesizing a SHA3 design on the Colorlight 5A-75B (Lattice ECP5 FPGA) board, I noticed that the LUT usage was way higher than when I used Vivado to synthesize the same design on a Xilinx board I own. It turns out that synthesizing with -nowidelut drastically reduced resource usage and significantly improved timing as well.

The problem was in the theta step of the algorithm:

module sha3theta (
  input  wire[4:0][4:0][63:0] i_state,
  output wire[4:0][4:0][63:0] o_state
);

  wire[4:0][63:0] sum_sheet;

  genvar i;
  generate
  for (i = 0; i < 5; i++) begin
    assign sum_sheet[i] =
      i_state[0][i] ^
      i_state[1][i] ^
      i_state[2][i] ^
      i_state[3][i] ^
      i_state[4][i];
  end
  endgenerate

  genvar row, col;
  generate
  for (row = 0; row < 5; row++) begin
    for (col = 0; col < 5; col++) begin
      assign o_state[row][col] =
        i_state[row][col] ^
        sum_sheet[(col - 1) % 5] ^
        {sum_sheet[(col + 1) % 5][62:0], sum_sheet[(col + 1) % 5][63:63]};
    end
  end
  endgenerate;

endmodule

Synthesizing this module with -nowidelut gives the following resource usage:

Number of cells: 1972
     LUT4 1972

And without the flag:

Number of cells: 9512
     L6MUX21 1188
     LUT4 5656
     PFUMX 2668

Expected Behavior

I would expect that at least the timing would be improved by the usage of the wide muxes found on the ECP5 FPGA.

Actual Behavior

Synthesizing the full design with -nowidelut improved the timing, Fmax being 90 MHz, vs 60 MHz without the flag. Of course I verified that this module was in the critical path.

The text was updated successfully, but these errors were encountered:

Ravenslofty · 2024-12-09T17:32:18Z

A situation like this occurs when ABC9 has a mismatch between predicted and actual delay.

Without -nowidelut, ABC9 predicts 1.225ns of delay, while with -nowidelut ABC9 predicts 1.457ns of delay.

For fun, I set the ABC9 delay target to 10ns, and got a solution using 1920 LUT4s.

Out of curiosity, what's your testing methodology here? Did you just run the design once under nextpnr?

t-wallet added the pending-verification This issue is pending verification and/or reproduction label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesis with -nowidelut gives drastically better results #4798

Synthesis with -nowidelut gives drastically better results #4798

t-wallet commented Dec 5, 2024 •

edited

Loading

Ravenslofty commented Dec 9, 2024

Synthesis with -nowidelut gives drastically better results #4798

Synthesis with -nowidelut gives drastically better results #4798

Comments

t-wallet commented Dec 5, 2024 • edited Loading

Version

On which OS did this happen?

Reproduction Steps

Expected Behavior

Actual Behavior

Ravenslofty commented Dec 9, 2024

t-wallet commented Dec 5, 2024 •

edited

Loading