Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TCO of C compiler to speed up emulation #95

Merged
merged 1 commit into from
Dec 20, 2022

Conversation

qwe661234
Copy link
Collaborator

We need to refactor the function emulate to a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function.

Running coremark and dhrystone benchmark now produces faster results than it did previously.

src/decode.h Outdated Show resolved Hide resolved
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run clang-format-12 -i src/*.[ch] to indent.
Please read https://github.com/sysprog21/rv32emu/blob/master/CONTRIBUTING.md carefully.

@jserv jserv changed the title Use TCO to accelerate funciton emulate Use TCO of C compiler to speed up emulation Dec 12, 2022
@jserv
Copy link
Contributor

jserv commented Dec 12, 2022

We need to refactor the function emulate to a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function.

Quote from Wikipedia:

code refactoring is the process of restructuring existing computer code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality.

"Refactoring" should not be considered because you do change the dispatching and instruction emulation behavior.

src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Dec 12, 2022

For clang support (the major compiler on macOS), we can explicitly define MUST_TAIL in src/common.h

#if defined(__has_attribute) && __has_attribute(musttail)
/* Clang requires a special tail recursion attribute to use tail recursion. */
#define MUST_TAIL __attribute__((musttail))
#else
#define MUST_TAIL
#endif

See https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/port_def.inc

@jserv
Copy link
Contributor

jserv commented Dec 13, 2022

You should provide performance metrics for both clang and gcc, at least running on Ubuntu Linux.

TODO:

  1. Ask @eecheng87 for eMag (Arm-based Workstation) access.
  2. We are concerned about the performance metrics of Clang versus GCC on x86-64 and AArch64 architectures, respectively.
  3. __attribute__((musttail)) must be enabled in clang-based builds, ensuring visible benefits.
  4. Build generic code with -O2 (no more aggressive optimization order is allowed at the moment). We may use __attribute__((optimize("O3"))) for certain functions. Be aware of negative impact for debugging.

src/emulate.c Outdated Show resolved Hide resolved
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify the performance metrics:

-----------------------------------------------------------------------
Test environment3: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 260.543173 Iterations/Sec
Now: 286.504547 Iterations/Sec
-----------------------------------------------------------------------
Test environment4: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 239.773443 Iterations/Sec
Now: 285.154751 Iterations/Sec

We shall check clang vs. gcc. By the way, ThunderX2 is based on older microarchitecture. Arm64-specific experiments should be carried out on eMag. Drop ThunderX2 related items.

src/common.h Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Dec 13, 2022

Performance metrics

Microprocessor compiler CoreMark w/ commit 285a988 CoreMark w/ PR #95 Speedup
Core i7-8700 gcc-9 870.317 iter/s 920.675 iter/s +5.8%
Core i7-8700 clang-16 805.702 iter/s 849.445 iter/s +5.4%
eMag 8180 gcc-11 311.436 iter/s 313.900 iter/s +0.8%
eMag 8180 clang-16 273.265 iter/s 297.255 iter/s +8.8%

The experiments should be amended as follows to make the comparisons more self-explanatory.

  • Unify gcc versioning - use latest stable release. e.g., gcc-12. You can install via Ubuntu ppa.
  • Unify clang versioning - use latest stable release. e.g., clang-15. You can install prebuilt packages.
  • Instead of running CoreMark once, you should execute it numerous times and average the results.

Then, the explanation of the gcc-aarch64 build's lack of TCO would then be revealed. At the moment, change the stage of this pull request to "draft" because it is not as effective as computed-goto in terms of the ratio for improvements.

@qwe661234 qwe661234 marked this pull request as draft December 14, 2022 05:08
qwe661234 added a commit to qwe661234/rv32emu that referenced this pull request Dec 14, 2022
We need to modify the function emulate into a recursive version for
meeting the requirement of tail-call optimization(TCO). To achieve this,
I add a variable is_tail to the struct rv_insn_t to help us determine
whether the basic block is terminate or not. As a result, we can use
this variable to rewrite function emulate into a self-recursive
function.

Running coremark benchmark now produces faster results
than it did previously, and the test results show below.

| Microprocessor | compiler | CoreMark w/ commit 285a988 | CoreMark w/ PR sysprog21#95 | Speedup |
|---------------------------------------------------------------------------------------|
| Core i7-8700   | clang-15 |        811.6384112         |     838.7883352    |  +3.3%  |
|---------------------------------------------------------------------------------------|
| Core i7-8700   | gcc-11   |        848.3487534         |     900.1869588    |  +6.1%  |
|---------------------------------------------------------------------------------------|
| eMag 8180      | clang-15 |        272.723566          |     295.1729862    |  +8.3%  |
|---------------------------------------------------------------------------------------|
| eMag 8180      | gcc-11   |        308.3846342         |     313.7543564    |  +1.7%  |

Previously, when the function emulate terminated, it returned to
function block_emulate because the previous calling route was rv_step
-> block_emulate -> emulate -> block_emulate -> emulate -> ... .
So, each time the function emulate was called, a function stack frame
was created. However, the current calling route is rv_step -> emulate ->
emulate -> ..., so function emulate can now use the same function stack
frame because of TCO. That is, any instructions in a basic block can
execute function emulate by using the same function stack frame and save
the overhead of creating function stack frame.
src/emulate.c Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Dec 19, 2022

We can eliminate the trailing rv->csr_cycle++; return true; in the implementation of the RISC-V instructions which can branch. Think of the changes below:

--- a/Makefile
+++ b/Makefile
@@ -5,6 +5,7 @@ OUT ?= build
 BIN := $(OUT)/rv32emu
 
 CFLAGS = -std=gnu99 -O2 -Wall -Wextra
+CFLAGS += -Wno-unused-label
 CFLAGS += -include src/common.h
 
 # Set the default stack pointer
--- a/src/decode.h
+++ b/src/decode.h
@@ -166,6 +166,13 @@ enum {
 #undef _
 };
 
+/* can-branch information for each RISC-V instruction */
+enum {
+#define _(inst, can_branch) __rv_insn_##inst##_canbranch = can_branch,
+    RISCV_INSN_LIST
+#undef _
+};
+
 /* clang-format off */
 /* instruction decode masks */
 enum {
--- a/src/emulate.c
+++ b/src/emulate.c
@@ -259,7 +259,14 @@ static inline bool insn_is_misaligned(uint32_t pc)
     static bool do_##inst(riscv_t *rv UNUSED, const rv_insn_t *ir UNUSED) \
     {                                                                     \
         rv->X[rv_reg_zero] = 0;                                           \
-        code rv->PC += ir->insn_len;                                      \
+        code;                                                             \
+        if (__rv_insn_##inst##_canbranch) {                               \
+            /* can branch */                                              \
+            rv->csr_cycle++;                                              \
+            return true;                                                  \
+        }                                                                 \
+    nextop:                                                               \
+        rv->PC += ir->insn_len;                                           \
         rv->csr_cycle++;                                                  \
         if (ir->tailcall)                                                 \
             return true;

Then, we can rewrite BEQ implementation for example.

 /* BEQ: Branch if Equal */
 RVOP(beq, {
     const uint32_t pc = rv->PC;
-    if (rv->X[ir->rs1] == rv->X[ir->rs2]) {
-        rv->PC += ir->imm;
-        /* check instruction misaligned */
-        if (unlikely(insn_is_misaligned(rv->PC))) {
-            rv->compressed = false;
-            rv_except_insn_misaligned(rv, pc);
-            return false;
-        }
-        /* increment the cycles csr */
-        rv->csr_cycle++;
-        /* can branch */
-        rv->csr_cycle++;
-        return true;
+    if (rv->X[ir->rs1] != rv->X[ir->rs2])
+        goto nextop;
+
+    rv->PC += ir->imm;
+    /* check instruction misaligned */
+    if (unlikely(insn_is_misaligned(rv->PC))) {
+        rv->compressed = false;
+        rv_except_insn_misaligned(rv, pc);
+        return false;
     }
 })
 
 /* BNE: Branch if Not Equal */

Code duplication should be avoided at all times. Each RISC-V instruction ought to see the statement rv->csr cycle++ as hidden. If CSR is completely disabled in some configurations, we may even be able to turn off the generation later.

src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/decode.h Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Dec 19, 2022

According to Ampere's product lines, eMAG 8180 is superior than ThunderX2. Git commit message should be amended. Be case-sensitive.

qwe661234 added a commit to qwe661234/rv32emu that referenced this pull request Dec 20, 2022
We adhere to the wasm3 implementation, which separates all instruction
emulations, and organize them into a funciton table. After doing
performance analysis, we discovered that emulator took a long time to
calculate the offset of function table. We therefore alter struct
rv_insn_t so that we can directly assign instruction emulation to IR
with adding member opfunc.

Running coremark benchmark now produces faster results than it did
previously, and the test results show below.

| Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup |
|------------------------------------------------------------------------------------------------|
| Core i7-8700   | clang-15 |        836.4849530         |         971.9516670         | +13.9%  |
|------------------------------------------------------------------------------------------------|
| Core i7-8700   | gcc-12   |        888.3423808         |         963.3369450         | +7.8%   |
|------------------------------------------------------------------------------------------------|
| eMAG 8180      | clang-15 |        286.0007652         |         335.396515          | +20.5%  |
|------------------------------------------------------------------------------------------------|
| eMAG 8180      | gcc-12   |        259.6389222         |         332.561175          | +14.0%  |

Previously, we had to calculate the jumping address using a method such
as switch-case, computed-goto, or function table, but this is no longer
necessary.
@qwe661234 qwe661234 requested a review from jserv December 20, 2022 13:22
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git commit messages were outdated. Measure based on latest code changes and rework the descriptions. In particular, potential concerns on TCO should be addressed.

src/decode.h Outdated Show resolved Hide resolved
@jserv jserv marked this pull request as ready for review December 20, 2022 13:59
src/emulate.c Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
To meet the tail-call optimization requirement, we must convert the
function emulate into a recursive version (TCO). To accomplish this, we
add a variable tailcall to the struct rv_insn_t to assist us in
determining whether or not the basic block is terminated. As a result,
we can rewrite function emulate into a self-recursive function using
this variable. However, after performing performance analysis, we
discovered that the emulator required a significant amount of time to
calculate the jumping address. As a result, we stick with the wasm3
implementation, which separates all instruction emulations, and modify
struct rv_insn_t so that we can directly assign instruction emulation to
IR by adding member impl.

Running coremark benchmark now produces faster results than it did
previously, and the test results show below.

| Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup |
|------------------------------------------------------------------------------------------------|
| Core i7-8700   | clang-15 |        836.4849530         |         971.9516670         | +13.9%  |
|------------------------------------------------------------------------------------------------|
| Core i7-8700   | gcc-12   |        888.3423808         |         963.3369450         | +7.8%   |
|------------------------------------------------------------------------------------------------|
| eMAG 8180      | clang-15 |        286.0007652         |         335.396515          | +20.5%  |
|------------------------------------------------------------------------------------------------|
| eMAG 8180      | gcc-12   |        259.6389222         |         332.561175          | +14.0%  |

Previously, when the function emulate terminated, it returned to the
function block_emulate because the previous calling sequence was rv_step
-> block_emulate -> emulate -> block_emulate -> emulate -> block_emulate
-> emulate ->.... As a result, a function stack frame was created each
time the function emulate was called. In addition, the jumping address
had to be calculated using a method such as switch-case, computed-goto
in function emulate.However, because we can now invoke instruction
emulation directly and the current calling route is rv_step ->
instruction emulation -> instruction emulation ->..., the instruction
emulation function can now use the same function stack frame due to TCO.
That is, any instruction in a basic block can emulate a function by
using the same function stack frame, saving the overhead of creating
function stack frames.
@qwe661234 qwe661234 requested review from jserv and removed request for Risheng1128 December 20, 2022 15:34
@jserv jserv merged commit 81674be into sysprog21:master Dec 20, 2022
jserv pushed a commit that referenced this pull request Dec 20, 2022
To meet the tail-call optimization requirement, we must convert the
function emulate into a recursive version (TCO). To accomplish this, we
add a variable tailcall to the struct rv_insn_t to assist us in
determining whether or not the basic block is terminated. As a result,
we can rewrite function emulate into a self-recursive function using
this variable. However, after performing performance analysis, we
discovered that the emulator required a significant amount of time to
calculate the jumping address. As a result, we stick with the wasm3
implementation, which separates all instruction emulations, and modify
struct rv_insn_t so that we can directly assign instruction emulation to
IR by adding member impl.

CoreMark results:

| Model        | Compiler | f2da162 | PR #95  | Speedup |
|--------------+----------+---------+---------+---------|
| Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9%  |
|--------------+----------+---------+---------+---------|
| Core i7-8700 | gcc-12   | 888.342 | 963.336 |  +7.8%  |
|--------------+----------+---------+---------+---------|
| eMAG 8180    | clang-15 | 286.000 | 335.396 | +20.5%  |
|--------------+----------+-------------------+---------|
| eMAG 8180    | gcc-12   | 259.638 | 332.561 | +14.0%  |

Previously, when function "emulate" terminated, it returned to
function "block_emulate" because the previous calling sequence was
    rv_step ->
        block_emulate ->
            emulate ->
                block_emulate ->
                    emulate ->
                        ...

As a result, a function stack frame was created each time function
"emulate" was invoked. In addition, the jumping address had to be
calculated using a method such as switch-case, computed-goto in
function "emulate". However, because we can now invoke instruction
emulation directly and the current calling route is
    rv_step ->
        instruction emulation ->
            instruction emulation ->
                ...

The instruction emulation  an now use the same function stack frame
due to TCO. That is, any instruction in a basic block can emulate a
function by using the same function stack frame, saving the overhead
of creating function stack frames.
jserv pushed a commit that referenced this pull request Dec 20, 2022
To meet the tail-call optimization requirement, we must convert the
function emulate into a recursive version (TCO). To accomplish this, we
add a variable tailcall to the struct rv_insn_t to assist us in
determining whether or not the basic block is terminated. As a result,
we can rewrite function emulate into a self-recursive function using
this variable. However, after performing performance analysis, we
discovered that the emulator required a significant amount of time to
calculate the jumping address. As a result, we stick with the wasm3
implementation, which separates all instruction emulations, and modify
struct rv_insn_t so that we can directly assign instruction emulation to
IR by adding member impl.

CoreMark results:

| Model        | Compiler | f2da162 | TCO     | Speedup |
|--------------+----------+---------+---------+---------|
| Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9%  |
|--------------+----------+---------+---------+---------|
| Core i7-8700 | gcc-12   | 888.342 | 963.336 |  +7.8%  |
|--------------+----------+---------+---------+---------|
| eMAG 8180    | clang-15 | 286.000 | 335.396 | +20.5%  |
|--------------+----------+---------+---------+---------|
| eMAG 8180    | gcc-12   | 259.638 | 332.561 | +14.0%  |

Previously, when function "emulate" terminated, it returned to
function "block_emulate" because the previous calling sequence was
    rv_step ->
        block_emulate ->
            emulate ->
                block_emulate ->
                    emulate ->
                        ...

As a result, a function stack frame was created each time function
"emulate" was invoked. In addition, the jumping address had to be
calculated using a method such as switch-case, computed-goto in
function "emulate". However, because we can now invoke instruction
emulation directly and the current calling route is
    rv_step ->
        instruction emulation ->
            instruction emulation ->
                ...

The instruction emulation  an now use the same function stack frame
due to TCO. That is, any instruction in a basic block can emulate a
function by using the same function stack frame, saving the overhead
of creating function stack frames.
qwe661234 added a commit to qwe661234/rv32emu that referenced this pull request Dec 22, 2022
In the previous implementation, fencei was treated as a branch
instruction, but it was assigned a missing value in the new branch list.
As a result, emulator fail to pass Zifencei test.

See: sysprog21#95
qwe661234 added a commit to qwe661234/rv32emu that referenced this pull request Dec 22, 2022
In the previous implementation, fencei was treated as a branch
instruction, but it was assigned a missing value in the new branch list.
As a result, emulator fails to pass Zifencei test.

See: sysprog21#95
@qwe661234 qwe661234 deleted the accelerate_rv_step branch January 2, 2023 10:57
Risheng1128 added a commit to Risheng1128/rv32emu that referenced this pull request Mar 1, 2023
According to sysprog21#95, computed-goto has been replaced by
tail-call optimization (TCO). Therefore, the option
about computed-goto is unnecessary.
2011eric pushed a commit to 2011eric/rv32emu that referenced this pull request Jul 22, 2023
According to sysprog21#95, computed-goto has been replaced by
tail-call optimization (TCO). Therefore, the option
about computed-goto is unnecessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants