This work was made for studying processes of binary translation and JIT compilation.
Our binary translator is a tool to convert binary code for my stack virtual cpu. For creating programs for our cpu you can use our C-like language.
Install and compile:
$ git clone https://github.com/XelerT/binary-translator.git
$ make
To compile with
$ make run_debug
Run:
$ ./jit bit_code.txt
Run tests:
$ make test
We will tell you about some information with was used for creating commands auto encoding.
Command in x86-64 can consists of 5 parts (first line is size in bytes):
------------------------------------------------------------
| 0 - 4 | 1 - 2 | 1 | 1 | 0 - 4 |
| Prefix | Opcode | M.R.R/M | S.I.B | Data |
-------------------------------------------------------------
-
Prefix part is used to encode 64 bit information, such as need to use 8 byte registers, need to use r8-r15 in source or destination.
-
In opcode command is encoded: (xxxxxx|ds). First 6 bits of this byte encode command, s-bit is 1 if we use 64-bit mode of command, d-bit contains information about source and destination:
register -> memory => d = 0 memory -> register => d = 1
-
Mode-Register-Register/Memory (M.R.R/M): (xx|xxx|xxx). First 2 bits show which arguments has command, 11 - if 2 operands are registers, 00 - if destination is memory, 10 - if destination is register and source is immediate number.
-
Scale-Index-Base (S.I.B): (xx|xxx|xxx). First 2 bits indicates the scaling factor of index, next 3 bits is the index register to use, last three is the base register to use.
-
Data field can contain immediate number or address.
For technical details see documentation.
During parsing process array of tokens is created. Each token contains information about each command: command encode for our cpu, (repeating x86 encode rules) mode, need of using r8-r15 registers etc. After this process simple IR is created.
After parsing, array of tokens is translated in x86 commands. Each token is going through some layers of check, folding stack constructions into non-stack logic, for example:
push 5
push rax => add rax, 5
add push rax
───────────────────────────────────────
push 5
pop rbx => mov rbx, 5
We obliged to save stack logic in order to avoid destruction of program logic therefor after math functions we push register into stack.
More encoding details.
Jumps' and call's addresses are fulfilled using 2 step passage.
Our cpu have 2 stacks, the first is for working with numbers and the second contains returning addresses. For solving this problem we emulate our second stack. Register r15 is used as second rsp, at the beginning of buffer
mov qword [r15], (return address)
sub r15, 8
jmp (function)
Before return we need to "pop" return address:
add r15, 8
push qword [r15]
ret
For more details see documentation
After creation of execution buffer we change rights for "text section" and "data section" for execution, read, write using mprotect. Then we cast execution buffer pointer to function pointer and call this "function".