Skip to content

Dynamic Binary Translation

Xiangyan Sun edited this page Sep 11, 2015 · 6 revisions

Foreign LINUX comes with a fast dynamic binary translator. It is capable of translating Linux binaries on the fly to overcome difficulties noted in difference between Linux and Windows. Its code is located at "dbt/" in the source tree.

Basic translation loop

The pseudo code is:

for (;;) {
    if (!(translated_pc = find_translation(pc))
        translated_pc = translate_basic_block(pc);
    execute(translated_pc);
}

The translation process runs on units of basic blocks. A basic block is a immediate flow of instructions with a branch instruction (jmp, call, etc) at the tail, but not inside. A translated basic block can be directly executed on CPU. When the last branch instruction is executed, we then need to determine whether the branch target is already translated. If so, we can simply jump there. Otherwise, we have to translate the basic block first. A call to dispatch trampoline is added to every translated basic block. Therefore the outer for loop in the above pseudo code is not needed in real implementation.

The main basic block translation procedure is located in function dbt_translate().

The basic translation loop is pretty easy to implement. But without any optimizations it will run extremely slow. It is obvious the most time consuming part in the execution is when we are doing branching. Therefore we mainly focus on optimizations of branch instructions.

Branch instructions are categorized into two categories: direct branch and indirect branch. Indirect branch can be further divided as general indirect branches and return instructions.

Direct branch

If a branch target can be determined at translation time, it is a direct branch. For example, calling a fixed function is a direct branch. There is only one deterministic target for the branch, thus it will be pretty easy to optimize.

When we encounter a direct branch. We first check whether it is already implemented. If so, we can easily encode the translated branch target. Otherwise, we build a trampoline which looks like: (dbt_get_direct_trampoline)

push $patch_addr
push $target_pc
jmp dbt_find_direct_internal

where patch_addr is a pointer to the branch target immediate in the original direct branch instruction. dbt_find_direct_internal is an assembly function for saving and restoring program context, it will call dbt_find_next. When it is called, it will translate the missing basic block, then patch the translated pc to patch_addr. Therefore when the original branch instruction is encountered the second time, it will directly jump to the correct target with no overhead.

Indirect branch

Indirect branch instructions looks like call *eax, or jmp [edx+ecx*4]. Since the branch target cannot be determined at translation time, it must be dispatched at run time. Its effectiveness directly affects the performance of a DBT system.

Currently the SIEVE approach is used to effectively handle indirect branches. This approach uses a hash table pointing to small delegate code fragments called sieve. It is same as a normal hash table, but stores code instead of data. When an indirect branch is executed, the destination address is hashed. It then jumps to the first sieve in the hash bucket.

push <dest_pc>
jmp sieve_dispatch_trampoline
...
sieve_dispatch_trampoline:
push ecx
movzx ecx, word ptr [esp+4] ; hash
jmp [sieve_table + ecx*4]

Inside the sieve, we compare whether the destination address is correct and jump to translated target if so. The jecxz instruction is used instead of cmp/jcc to avoid EFLAGS pollution.

mov ecx, [esp+4]
lea ecx, [ecx - $original_pc]
jecxz match
jmp $next-sieve

match:
pop ecx
lea esp, [esp+4]
jmp $translated_pc

The target next-sieve points to the next sieve in the same hash bucket. The last sieve in the bucket has this value set to sieve_fallback which fallback to C code.

Return handling

The ret instruction is a special form of indirect branching which is highly predictable. Some DBT systems uses the "fast return" approach to minimize the overhead. When call is executed, it pushes the translated return address instead of original return address to the stack. Thus the ret instruction could be directly used without fear. The major problem is this approach is not transparent to the underlying application, and certain features like C++ exception handling will fail if the return addresses are altered.

In Foreign Linux we apply the technique called "return cache". The return cache is a hash table consists of mappings of return addresses. Each hash bucket has only one entry. When a call instruction is executed, the translated pc of return address is written to the hash table bucket. When a return instruction is executed, it hashes the return address on stack and blindly jumps to the corresponding hash entry. The cached return address may be incorrect. To handle this case, a small stub is added after the call instruction.

call_postamble:
push ecx
mov ecx, [esp+4]
lea ecx, [ecx-$source_pc]
jecxz match
jmp sieve-fallback

match:
pop ecx
lea esp, [esp+4]
... code after call ...

Here we take the fact that a return cache entry can only be written by a call instruction. When initializing the return cache is filled with sieve-fallback.

Call return pair optimization

Modern CPUs have advanced branch prediction mechanisms to improve execution efficiency. One of such optimization is called "return stack buffer". It is a ring buffer which is a stack like cache of return addresses. When call instruction is executed, the return address is pushed to the buffer. When ret is executed, the stack top is popped and used as a prediction of target address. To take advantages of this, call and ret instructions must be in pair. In traditional dynamic binary translation systems both the two instruction are usually implemented using jmp instructions to various trampolines which totally missed the return stack buffer optimization.

In Foreign Linux, the call and ret instructions are kept with surrounding stub code to take advantage of the return stack buffer.

CPUID emulation

Since each instruction must be explicitly supported to work. We cannot pass the CPUID information of the host directly to the application. Consider the scenario when AVX is not supported in flinux but the CPU supports it. Directly using native CPUID will make the application think they can use AVX instruction which they can't. Therefore we have to intercept CPUID instruction and mask out features we don't support.

GS segment register handling

The rationality of this is described in difference between Linux and Windows. As we cannot safely use the GS segment register in Windows, all such instructions must be converted to a form which does not use the register.

For general instructions referencing memory with GS-offset, such as gs:[base_reg+index_reg*scale+offset], the instruction is translated like below.

mov fs:[scratch], temp_reg
mov temp_reg, fs:[gs_addr]          ; temp_reg = GS address
lea temp_reg, [temp_reg + base_reg] ; skip this if base_reg is empty
<ins> gs:[temp_reg + index_reg*scale+offset]
mov temp_reg, fs:[scratch]

For special instructions, the handling is similar.