Use TCO of C compiler to speed up emulation #95

qwe661234 · 2022-12-12T13:03:56Z

We need to refactor the function emulate to a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function.

Running coremark and dhrystone benchmark now produces faster results than it did previously.

src/decode.h

jserv

Run clang-format-12 -i src/*.[ch] to indent.
Please read https://github.com/sysprog21/rv32emu/blob/master/CONTRIBUTING.md carefully.

jserv · 2022-12-12T13:33:21Z

We need to refactor the function emulate to a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function.

Quote from Wikipedia:

code refactoring is the process of restructuring existing computer code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality.

"Refactoring" should not be considered because you do change the dispatching and instruction emulation behavior.

src/emulate.c

src/decode.h

jserv · 2022-12-12T17:32:44Z

For clang support (the major compiler on macOS), we can explicitly define MUST_TAIL in src/common.h

#if defined(__has_attribute) && __has_attribute(musttail)
/* Clang requires a special tail recursion attribute to use tail recursion. */
#define MUST_TAIL __attribute__((musttail))
#else
#define MUST_TAIL
#endif

See https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/port_def.inc

jserv · 2022-12-13T03:04:22Z

You should provide performance metrics for both clang and gcc, at least running on Ubuntu Linux.

TODO:

Ask @eecheng87 for eMag (Arm-based Workstation) access.
We are concerned about the performance metrics of Clang versus GCC on x86-64 and AArch64 architectures, respectively.
__attribute__((musttail)) must be enabled in clang-based builds, ensuring visible benefits.
Build generic code with -O2 (no more aggressive optimization order is allowed at the moment). We may use __attribute__((optimize("O3"))) for certain functions. Be aware of negative impact for debugging.

src/emulate.c

jserv

Clarify the performance metrics:

-----------------------------------------------------------------------
Test environment3: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 260.543173 Iterations/Sec
Now: 286.504547 Iterations/Sec
-----------------------------------------------------------------------
Test environment4: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 239.773443 Iterations/Sec
Now: 285.154751 Iterations/Sec

We shall check clang vs. gcc. By the way, ThunderX2 is based on older microarchitecture. Arm64-specific experiments should be carried out on eMag. Drop ThunderX2 related items.

src/common.h

src/decode.h

Makefile

src/decode.h

jserv · 2022-12-13T17:15:18Z

Performance metrics

Microprocessor	compiler	CoreMark w/ commit `285a988`	CoreMark w/ PR #95	Speedup
Core i7-8700	gcc-9	870.317 iter/s	920.675 iter/s	+5.8%
Core i7-8700	clang-16	805.702 iter/s	849.445 iter/s	+5.4%
eMag 8180	gcc-11	311.436 iter/s	313.900 iter/s	+0.8%
eMag 8180	clang-16	273.265 iter/s	297.255 iter/s	+8.8%

The experiments should be amended as follows to make the comparisons more self-explanatory.

Unify gcc versioning - use latest stable release. e.g., gcc-12. You can install via Ubuntu ppa.
Unify clang versioning - use latest stable release. e.g., clang-15. You can install prebuilt packages.
Instead of running CoreMark once, you should execute it numerous times and average the results.

Then, the explanation of the gcc-aarch64 build's lack of TCO would then be revealed. At the moment, change the stage of this pull request to "draft" because it is not as effective as computed-goto in terms of the ratio for improvements.

We need to modify the function emulate into a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit 285a988 | CoreMark w/ PR sysprog21#95 | Speedup | |---------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 811.6384112 | 838.7883352 | +3.3% | |---------------------------------------------------------------------------------------| | Core i7-8700 | gcc-11 | 848.3487534 | 900.1869588 | +6.1% | |---------------------------------------------------------------------------------------| | eMag 8180 | clang-15 | 272.723566 | 295.1729862 | +8.3% | |---------------------------------------------------------------------------------------| | eMag 8180 | gcc-11 | 308.3846342 | 313.7543564 | +1.7% | Previously, when the function emulate terminated, it returned to function block_emulate because the previous calling route was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... . So, each time the function emulate was called, a function stack frame was created. However, the current calling route is rv_step -> emulate -> emulate -> ..., so function emulate can now use the same function stack frame because of TCO. That is, any instructions in a basic block can execute function emulate by using the same function stack frame and save the overhead of creating function stack frame.

src/emulate.c

jserv · 2022-12-19T17:48:05Z

We can eliminate the trailing rv->csr_cycle++; return true; in the implementation of the RISC-V instructions which can branch. Think of the changes below:

--- a/Makefile
+++ b/Makefile
@@ -5,6 +5,7 @@ OUT ?= build
 BIN := $(OUT)/rv32emu
 
 CFLAGS = -std=gnu99 -O2 -Wall -Wextra
+CFLAGS += -Wno-unused-label
 CFLAGS += -include src/common.h
 
 # Set the default stack pointer
--- a/src/decode.h
+++ b/src/decode.h
@@ -166,6 +166,13 @@ enum {
 #undef _
 };
 
+/* can-branch information for each RISC-V instruction */
+enum {
+#define _(inst, can_branch) __rv_insn_##inst##_canbranch = can_branch,
+    RISCV_INSN_LIST
+#undef _
+};
+
 /* clang-format off */
 /* instruction decode masks */
 enum {
--- a/src/emulate.c
+++ b/src/emulate.c
@@ -259,7 +259,14 @@ static inline bool insn_is_misaligned(uint32_t pc)
     static bool do_##inst(riscv_t *rv UNUSED, const rv_insn_t *ir UNUSED) \
     {                                                                     \
         rv->X[rv_reg_zero] = 0;                                           \
-        code rv->PC += ir->insn_len;                                      \
+        code;                                                             \
+        if (__rv_insn_##inst##_canbranch) {                               \
+            /* can branch */                                              \
+            rv->csr_cycle++;                                              \
+            return true;                                                  \
+        }                                                                 \
+    nextop:                                                               \
+        rv->PC += ir->insn_len;                                           \
         rv->csr_cycle++;                                                  \
         if (ir->tailcall)                                                 \
             return true;

Then, we can rewrite BEQ implementation for example.

 /* BEQ: Branch if Equal */
 RVOP(beq, {
     const uint32_t pc = rv->PC;
-    if (rv->X[ir->rs1] == rv->X[ir->rs2]) {
-        rv->PC += ir->imm;
-        /* check instruction misaligned */
-        if (unlikely(insn_is_misaligned(rv->PC))) {
-            rv->compressed = false;
-            rv_except_insn_misaligned(rv, pc);
-            return false;
-        }
-        /* increment the cycles csr */
-        rv->csr_cycle++;
-        /* can branch */
-        rv->csr_cycle++;
-        return true;
+    if (rv->X[ir->rs1] != rv->X[ir->rs2])
+        goto nextop;
+
+    rv->PC += ir->imm;
+    /* check instruction misaligned */
+    if (unlikely(insn_is_misaligned(rv->PC))) {
+        rv->compressed = false;
+        rv_except_insn_misaligned(rv, pc);
+        return false;
     }
 })
 
 /* BNE: Branch if Not Equal */

Code duplication should be avoided at all times. Each RISC-V instruction ought to see the statement rv->csr cycle++ as hidden. If CSR is completely disabled in some configurations, we may even be able to turn off the generation later.

src/emulate.c

src/decode.h

jserv · 2022-12-19T19:41:15Z

According to Ampere's product lines, eMAG 8180 is superior than ThunderX2. Git commit message should be amended. Be case-sensitive.

We adhere to the wasm3 implementation, which separates all instruction emulations, and organize them into a funciton table. After doing performance analysis, we discovered that emulator took a long time to calculate the offset of function table. We therefore alter struct rv_insn_t so that we can directly assign instruction emulation to IR with adding member opfunc. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup | |------------------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 836.4849530 | 971.9516670 | +13.9% | |------------------------------------------------------------------------------------------------| | Core i7-8700 | gcc-12 | 888.3423808 | 963.3369450 | +7.8% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | clang-15 | 286.0007652 | 335.396515 | +20.5% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | gcc-12 | 259.6389222 | 332.561175 | +14.0% | Previously, we had to calculate the jumping address using a method such as switch-case, computed-goto, or function table, but this is no longer necessary.

jserv

The git commit messages were outdated. Measure based on latest code changes and rework the descriptions. In particular, potential concerns on TCO should be addressed.

src/decode.h

src/emulate.c

To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup | |------------------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 836.4849530 | 971.9516670 | +13.9% | |------------------------------------------------------------------------------------------------| | Core i7-8700 | gcc-12 | 888.3423808 | 963.3369450 | +7.8% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | clang-15 | 286.0007652 | 335.396515 | +20.5% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | gcc-12 | 259.6389222 | 332.561175 | +14.0% | Previously, when the function emulate terminated, it returned to the function block_emulate because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> block_emulate -> emulate ->.... As a result, a function stack frame was created each time the function emulate was called. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function emulate.However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation ->..., the instruction emulation function can now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.

To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. CoreMark results: | Model | Compiler | f2da162 | PR #95 | Speedup | |--------------+----------+---------+---------+---------| | Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9% | |--------------+----------+---------+---------+---------| | Core i7-8700 | gcc-12 | 888.342 | 963.336 | +7.8% | |--------------+----------+---------+---------+---------| | eMAG 8180 | clang-15 | 286.000 | 335.396 | +20.5% | |--------------+----------+-------------------+---------| | eMAG 8180 | gcc-12 | 259.638 | 332.561 | +14.0% | Previously, when function "emulate" terminated, it returned to function "block_emulate" because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... As a result, a function stack frame was created each time function "emulate" was invoked. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function "emulate". However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation -> ... The instruction emulation an now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.

To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. CoreMark results: | Model | Compiler | f2da162 | TCO | Speedup | |--------------+----------+---------+---------+---------| | Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9% | |--------------+----------+---------+---------+---------| | Core i7-8700 | gcc-12 | 888.342 | 963.336 | +7.8% | |--------------+----------+---------+---------+---------| | eMAG 8180 | clang-15 | 286.000 | 335.396 | +20.5% | |--------------+----------+---------+---------+---------| | eMAG 8180 | gcc-12 | 259.638 | 332.561 | +14.0% | Previously, when function "emulate" terminated, it returned to function "block_emulate" because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... As a result, a function stack frame was created each time function "emulate" was invoked. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function "emulate". However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation -> ... The instruction emulation an now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.

In the previous implementation, fencei was treated as a branch instruction, but it was assigned a missing value in the new branch list. As a result, emulator fail to pass Zifencei test. See: sysprog21#95

In the previous implementation, fencei was treated as a branch instruction, but it was assigned a missing value in the new branch list. As a result, emulator fails to pass Zifencei test. See: sysprog21#95

According to sysprog21#95, computed-goto has been replaced by tail-call optimization (TCO). Therefore, the option about computed-goto is unnecessary.

jserv reviewed Dec 12, 2022

View reviewed changes

src/decode.h Outdated Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch from 416aafa to c4ddbd2 Compare December 12, 2022 13:07

jserv requested changes Dec 12, 2022

View reviewed changes

jserv changed the title ~~Use TCO to accelerate funciton emulate~~ Use TCO of C compiler to speed up emulation Dec 12, 2022

jserv requested a review from Risheng1128 December 12, 2022 13:17

Risheng1128 reviewed Dec 12, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

Risheng1128 reviewed Dec 12, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch 2 times, most recently from fba7a7d to 13972b1 Compare December 12, 2022 15:20

qwe661234 requested a review from jserv December 12, 2022 15:21

qwe661234 force-pushed the accelerate_rv_step branch from 13972b1 to 24cd6c0 Compare December 12, 2022 16:56

jserv reviewed Dec 12, 2022

View reviewed changes

src/decode.h Outdated Show resolved Hide resolved

jserv reviewed Dec 13, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch from 24cd6c0 to 01abf3c Compare December 13, 2022 10:23

jserv requested changes Dec 13, 2022

View reviewed changes

jserv reviewed Dec 13, 2022

View reviewed changes

src/common.h Outdated Show resolved Hide resolved

jserv reviewed Dec 13, 2022

View reviewed changes

src/decode.h Outdated Show resolved Hide resolved

jserv reviewed Dec 13, 2022

View reviewed changes

Makefile Outdated Show resolved Hide resolved

jserv mentioned this pull request Dec 13, 2022

Lower instruction decoding and dispatch overhead #88

Closed

qwe661234 force-pushed the accelerate_rv_step branch 2 times, most recently from 4c529bd to 2ee914a Compare December 13, 2022 15:12

jserv reviewed Dec 13, 2022

View reviewed changes

src/decode.h Outdated Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch from 2ee914a to 044cd7b Compare December 13, 2022 15:56

qwe661234 requested a review from jserv December 13, 2022 15:58

qwe661234 marked this pull request as draft December 14, 2022 05:08

jserv reviewed Dec 19, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Dec 19, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Dec 19, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Dec 19, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Dec 19, 2022

View reviewed changes

src/decode.h Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch from 19da2b0 to 0158b4f Compare December 20, 2022 13:20

qwe661234 requested a review from jserv December 20, 2022 13:22

jserv requested changes Dec 20, 2022

View reviewed changes

jserv reviewed Dec 20, 2022

View reviewed changes

src/decode.h Outdated Show resolved Hide resolved

jserv marked this pull request as ready for review December 20, 2022 13:59

jserv requested a review from Risheng1128 December 20, 2022 13:59

jserv reviewed Dec 20, 2022

View reviewed changes

src/emulate.c Show resolved Hide resolved

jserv reviewed Dec 20, 2022

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Dec 20, 2022

View reviewed changes

src/emulate.c Show resolved Hide resolved

qwe661234 force-pushed the accelerate_rv_step branch from 0158b4f to 69b780c Compare December 20, 2022 15:32

qwe661234 requested review from jserv and removed request for Risheng1128 December 20, 2022 15:34

jserv merged commit 81674be into sysprog21:master Dec 20, 2022

qwe661234 deleted the accelerate_rv_step branch January 2, 2023 10:57

Risheng1128 added a commit to Risheng1128/rv32emu that referenced this pull request Mar 1, 2023

Remove the computed-goto option

b2c761e

According to sysprog21#95, computed-goto has been replaced by tail-call optimization (TCO). Therefore, the option about computed-goto is unnecessary.

Risheng1128 mentioned this pull request Mar 1, 2023

Remove the computed-goto option #116

Merged

2011eric pushed a commit to 2011eric/rv32emu that referenced this pull request Jul 22, 2023

Remove the computed-goto option

972dbc9

According to sysprog21#95, computed-goto has been replaced by tail-call optimization (TCO). Therefore, the option about computed-goto is unnecessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use TCO of C compiler to speed up emulation #95

Use TCO of C compiler to speed up emulation #95

qwe661234 commented Dec 12, 2022

jserv left a comment

jserv commented Dec 12, 2022

jserv commented Dec 12, 2022 •

edited

Loading

jserv commented Dec 13, 2022 •

edited

Loading

jserv left a comment •

edited

Loading

jserv commented Dec 13, 2022 •

edited

Loading

jserv commented Dec 19, 2022 •

edited

Loading

jserv commented Dec 19, 2022 •

edited

Loading

jserv left a comment

Use TCO of C compiler to speed up emulation #95

Use TCO of C compiler to speed up emulation #95

Conversation

qwe661234 commented Dec 12, 2022

jserv left a comment

Choose a reason for hiding this comment

jserv commented Dec 12, 2022

jserv commented Dec 12, 2022 • edited Loading

jserv commented Dec 13, 2022 • edited Loading

jserv left a comment • edited Loading

Choose a reason for hiding this comment

jserv commented Dec 13, 2022 • edited Loading

jserv commented Dec 19, 2022 • edited Loading

jserv commented Dec 19, 2022 • edited Loading

jserv left a comment

Choose a reason for hiding this comment

jserv commented Dec 12, 2022 •

edited

Loading

jserv commented Dec 13, 2022 •

edited

Loading

jserv left a comment •

edited

Loading

jserv commented Dec 13, 2022 •

edited

Loading

jserv commented Dec 19, 2022 •

edited

Loading

jserv commented Dec 19, 2022 •

edited

Loading