Draft for (complex) hash labeling scheme #151

kito-cheng · 2023-08-30T01:08:51Z

Dump some content from my offlist discussion with @ved-rivos :P

Introduce one more function in asm syntax: %lpad_label("<signature-string>").
e.g. lpad %lpad_label("v(i,i)"), lui t2, %lpad_label("v(i,i)") for void foo(int, int)

The signature string sytax:

SIGNATURE-STRING := RETURN-TYPE '(' ARGS ')'
                  | RETURN-TYPE '(' ')'
RETURN-TYPE := TYPE-NAME

ARGS := ARGS ',' ARG
      | ARG

ARG  := TYPE-NAME

TYPE-NAME is using the mangling name in itanium c++ abi[1] and RISC-V psABI[2].
[1] https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling-builtin
[2] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-elf.adoc#c-name-mangling

The value is the hash result of the signature string, we may start from md5.

void foo(int, int) = %lpad_label("v(i,i)") = low20(md5("v(i,i)")) = low20(3a7a68c073dcdbd6ee282a598c2dc44e) = 0xdc44e

Why md5: no strong reason for using md5, but one reason is it's available on binutils and llvm already.
Why using mangling name defined in itanium c++ abi: Prevent us to define complicate rule for mangling name.

More example:
void bar(): v()
void *memset(void *, int, size_t): Pv(Pv,i,m)

The text was updated successfully, but these errors were encountered:

yetingk · 2023-08-30T16:51:54Z

I have two questions about using function signatures as landing label on function prologue.

Landing label for indirect branches within a function. A simple proposal is also to use the function signature as the landing pad. The benefit is it allows indirect branch to its function entry. But it also increase the risk of indirect calls which are not landed to a function entry.
We may need to have a landing label for the next pc of dual return functions like setjmp.

kito-cheng · 2023-09-14T02:44:02Z

Landing label for indirect branches within a function. A simple proposal is also to use the function signature as the landing pad. The benefit is it allows indirect branch to its function entry. But it also increase the risk of indirect calls which are not landed to a function entry.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

We may need to have a landing label for the next pc of dual return functions like setjmp.

Generally those function are using ret to return function, so they don't need landing pad, however I found x86 has some special function attribute called indirect_return, we might take a look for that to see if we need to define a same one as well.

samitolvanen · 2023-09-21T22:55:28Z

Why md5: no strong reason for using md5, but one reason is it's available on binutils and llvm already.

I would prefer not to add more uses of MD5, even if in this application there aren't any actual security concerns. Since xxHash is also readily available, why not use that instead?

Why using mangling name defined in itanium c++ abi: Prevent us to define complicate rule for mangling name.

This name mangling scheme matches LLVM's existing forward-edge CFI implementations and I agree that it's a reasonable choice.

A simple proposal is also to use the function signature as the landing pad.

If you mean using the same label for the function entry and all the indirect branch targets in the function, I don't think that's acceptable. We should ensure that indirect calls are allowed only to function entry points.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

I would suggest at least providing a recommendation for the indirect branch labeling scheme as well, so we won't end up with implementations that accidentally make CFI weaker.

kito-cheng · 2023-09-22T10:46:07Z

I would prefer not to add more uses of MD5, even if in this application there aren't any actual security concerns. Since xxHash is also readily available, why not use that instead?

Honestly I am not security guy, so that's my first time to heard xxHash, and I am open mind for other algorithm :)

A simple proposal is also to use the function signature as the landing pad.

If you mean using the same label for the function entry and all the indirect branch targets in the function, I don't think that's acceptable. We should ensure that indirect calls are allowed only to function entry points.

A branch within a function is not expose to other module, so IMO that could be implementation defined.

I would suggest at least providing a recommendation for the indirect branch labeling scheme as well, so we won't end up with implementations that accidentally make CFI weaker.

Yeah, sounds good idea.

ved-rivos · 2023-09-22T11:48:07Z

xxHash is not a cryptographic hash. xxHash is faster to compute. MD5 is also used by LLVM for CFI. For this purpose I would consider MD5 and xxHash to be equivalent.

For a branch within a function we could pick the convention of using the :<source_line_number>. That would restrict its scope.

samitolvanen · 2023-09-25T15:49:23Z

xxHash is not a cryptographic hash. xxHash is faster to compute. MD5 is also used by LLVM for CFI. For this purpose I would consider MD5 and xxHash to be equivalent.

The original CFI implementation in Clang uses MD5, but the newer ones (-fsanitize=kcfi and -fsanitize=function) have adopted xxHash:

https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.cpp#L572
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenModule.cpp#L1995

For a branch within a function we could pick the convention of using the :<source_line_number>. That would restrict its scope.

Agreed, that's better. Ideally each branch target should have a unique label, so I would recommend including the file path in addition to the line number, for example.

kito-cheng · 2023-10-02T14:24:32Z

I suspect encoding line number may not work for all situation, one example is GNU C extension support "Labels as Values", but I do agree that should have different label than function, so I think we may add some note about that "NOTE: Targets of local indirect jumps that are not used for jump tables should have a landing pad. Additionally, the labeling value for these targets is recommended to be different from the function label value."

    void *ptr;

    if(...)
        ptr = &&foo;
    else
        ptr = &&bar;

    /* ... */
    goto *ptr;

foo:
    /* ... */

bar:
    /* ... */

[1] https://stackoverflow.com/questions/56316820/what-is-an-indirect-goto-statement
[2] https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

samitolvanen · 2023-10-03T23:07:35Z

I suspect encoding line number may not work for all situation, one example is GNU C extension support "Labels as Values"

Good point. Basically, all address-taken labels in a function must have the same landing pad label. We don't have to tie the label to the line number if that's confusing. It's enough that the label is unique within the program. E.g., HASH(path + counter).

NOTE: Targets of local indirect jumps that are not used for jump tables should have a landing pad.

Unless the compiler uses a software-guarded branch, the targets must have a landing pad. Do we want to include a recommendation for when it's safe to use a software-guarded branch?

Additionally, the labeling value for these targets is recommended to be different from the function label value.

This doesn't sound strong enough, because setting all indirect branch target labels to the same value would be enough in this case. How about recommending that each indirect branch target must have a landing pad label that's unique within the program, with the exception that all address-taken labels in the same function must have the same landing pad label?

Yangff · 2023-10-04T20:26:27Z

all address-taken labels in the same function must have the same landing pad label

what about the catch block which has a landing pad used by the runtime?

samitolvanen · 2023-10-04T21:13:46Z

what about the catch block which has a landing pad used by the runtime?

Please correct me if I'm wrong, but shouldn't C++ exceptions behave similarly to a function return, which means the exception landing pad won't need an lpad instruction?

Yangff · 2023-10-04T21:34:43Z

what about the catch block which has a landing pad used by the runtime?

Please correct me if I'm wrong, but shouldn't C++ exceptions behave similarly to a function return, which means the exception landing pad won't need an lpad instruction?

My impression is they use forward indirect jump to restore context instead of ret. This might be wrong though..

this vary by the implementation.. and for GCC and llvm it's just a ret.

samitolvanen · 2023-10-04T23:35:48Z

My impression is they use forward indirect jump to restore context instead of ret.

It depends on the architecture, but the RISC-V implementation in LLVM's libunwind seems to use ret at least:

https://github.com/llvm/llvm-project/blob/main/libunwind/src/UnwindRegistersRestore.S#L1159

Are you aware of other implementations that behave differently?

Yangff · 2023-10-05T00:33:47Z

My impression is they use forward indirect jump to restore context instead of ret.

It depends on the architecture, but the RISC-V implementation in LLVM's libunwind seems to use ret at least:

https://github.com/llvm/llvm-project/blob/main/libunwind/src/UnwindRegistersRestore.S#L1159

Are you aware of other implementations that behave differently?

You are right, it depends on the implementation. nongnu libunwind seems to use jr with non ra reg and GCC uses __builtin_eh_return which should be a ret.

With that said, it should be possible to assign a landingpad for it? Because that ret cannot be enforced by sspop since the address is not in stack anyway..

samitolvanen · 2023-10-05T23:30:07Z

Thanks for looking into this. It sounds like we should either require compatible implementations to use a return, or prepare a recommendation for exception handler landing pad labeling as well since there are implementations that use indirect jumps with a non-ra register.

Because that ret cannot be enforced by sspop since the address is not in stack anyway..

That's no different from setjmp etc. though. How much of a concern is this?

Yangff · 2023-10-06T01:12:07Z

setjmp

I think on x86 CET compiler emits an endbr64 after setjmp call so that the longjmp actually uses jmp *%rcx to go back to the context, instead of return.
Security wise it prevents the exploitable arbitrary jump gadget in the program.. but for compatibility this may cause problem..

mylai-mtk · 2024-01-31T02:57:33Z

I have a question about the %lpad_label("<signature-string>") syntax: This works obviously for assembling, but what about disassembling? Do we display the labels as numbers when disassembling, or we have to craft a (standard?) method to map the numbers back to signature strings?

ved-rivos · 2024-01-31T10:29:41Z

The disassembly should only show the number. Since the label is generated by hashing the string and then truncating the hash to 20 bits it will not be possible to revere the process and regenerate the string from the truncated hash.

kito-cheng · 2024-02-01T07:51:09Z

I have a question about the %lpad_label("") syntax: This works obviously for assembling, but what about disassembling? Do we display the labels as numbers when disassembling, or we have to craft a (standard?) method to map the numbers back to signature strings?

It's technical possible* to keep the signature string but I am not sure it's necessary or not?

* e.g. keep that via mapping symbol or customized section to record that.

ved-rivos · 2024-02-01T15:03:46Z

I am not sure its necessary. If we do add it we may want it to be strippable.

mylai-mtk · 2024-02-02T03:20:43Z

I think if we're considering the necessity, this disassembling display rule can be considered "unnecessary", since it does not affect runtime behavior.

If we leave this part unspecified in the spec, I believe the mainstream toolchains would choose their favorable formats when the needs arise, and other late comers may just follow suit. This may lead to some interoperabilty issue in disassemblers, but it should not be a big problem, because they can/should always fallback to number when the string is unavailable.

mylai-mtk · 2024-02-20T09:02:19Z

Question: It looks like that TYPE-NAME is a terminal in the grammar. What if an ARG contains a function prototype, e.g. a function pointer parameter? Do we encode the function prototype recursively, or we just throw it to the Itanium mangler?

gcchri · 2024-02-20T20:09:58Z

From my understanding it seems that recursive encoding will lead to finer label distribution.

sorear · 2024-02-27T20:53:03Z

We should mangle the entire possibly qualified function type, not just the arguments and returns. U9vector_ccFvvE is a different function type from FvvE and confusing them should be caught. I'd rather not make up our own list syntax, as simple as it is, when Itanium provides F types we can use.

Providing the unhashed type in strippable debug information for disassembly sounds like a useful feature. I don't know enough about debug information to say whether that should be a symbol or something in DWARF.

ret does not check for a landing pad because Zicfilp is intended to be used in conjunction with Zicfiss. After sspopchk, ret is a software guarded branch and is treated as such. The CFI attacker model can corrupt the stack information which controls DWARF unwinding as well as the contents of jmp_buf structs, so some form of shadow stack or landing pad protection, but not both, is needed for longjmp and C++ exceptions.

The landing pad approach defines a label (singular, since there is no meaningful type information) to use for longjmp as well as a label (singular, since the unwinder can be corrupted we cannot trust any label it outputs) to use for __builtin_eh_return.

The shadow stack approach requires adding additional return addresses to the shadow stack, and checking in longjmp or __builtin_eh_return if the shadow stack pointer in the return context is inside the current shadow stack and points at the address returned to. For setjmp, there is an additional complication since setjmp calls are not necessarily well nested, see below. A similar approach with a different token could be used for __builtin_eh_return but this may be controversial since it violates the "zero cost on the non-exceptional path" property DWARF advertises (in practice, exceptions are far from zero cost due to lost optimization opportunities, and this approach provides far better protection than a single type of landing pad).

void foo() {
   ...
   r = setjmp(&jb);
   ...
   r = setjmp(&jb);
   ...
}

foo:
   sspush ra
   la t1, 1f
   sspush t1
   la t1, 2f
   sspush t1
   li t1, SETJMP_TOKEN(2)
   sspush t1
   ... create a stack frame ...
   ...
   call setjmp
   ...
   call setjmp
   ...
   ... stack frame teardown ...
   ssrdp t1
   addi t1, t1, 3*(XLEN/8)
   csrw ssp, t1 # Possible ABI issue, see #206
   sspopchk ra
   ret

longjmp pseudocode:
   if (!IS_SETJMP_TOKEN(*ssp)) crash();
   for (int i = 0; i < SETJMP_TOKEN_COUNT(*ssp), i++)
      if (*(ssp+i) == ra) goto good;
   crash();
  good:

mylai-mtk · 2024-03-06T09:00:46Z

Also, we may have to consider:

K&R-style functions, where the function signature is not fixed, and may not even be known at the call site. It can happen that a call to a non-K&R-style function is made through a K&R-style function pointer.
Variadic functions
Class methods where the pointer to class object itself is passed as the first argument --> Do we encode the type of this pointer as the first argument?
Virtual methods --> How do we encode the type of this pointer, both at the caller and callee site? Do we use the base class in which the virtual function is first declared?

sorear · 2024-03-11T03:42:56Z

K&R-style functions

Signature is inferred at the call site on the basis of the promoted actual arguments. This is a general phenomenon; LLVM doesn't even have a concept of calling a function without a signature.

Variadic functions

Represented in the type system and handled fine with a mangled function type, although not with the original proposal.

Class methods where the pointer to class object itself is passed as the first argument

Virtual methods

These can only be indirectly called through a pointer to member function. Use the mangling rules for those.

mylai-mtk · 2024-03-11T06:47:15Z

Signature is inferred at the call site on the basis of the promoted actual arguments.

I agree we can always obtain a function signature when calling a function from the the call site's view. However, from the view of the callee, it assumes the types of the passed arguments at compile time. By calling through K&R-style function pointers, we can pass arguments not matching the callee's assumption, so even if we have a "signature" at the call site, it would not match what's expected by the callee, thus the calculated landing pad labels would not match.

--> I later found out this is mostly like an undefined behavior in the C standard, so I guess we may be able to rule this case out, but an interesting exception in the (C11 draft) standard caught my attention:

If the function is defined with a type that does not include a prototype, and the types of
the arguments after promotion are not compatible with those of the parameters after
promotion, the behavior is undefined, except for the following cases:
— one promoted type is a signed integer type, the other promoted type is the
corresponding unsigned integer type, and the value is representable in both types;

Does this mean that if we distinguish signed and unsigned integer types in label calculation, some weird cases can be made?

Zicfilp has provided two labeling schemes: simple and complex (also known as function signature-based). The simple scheme uses an lpad with a constant 0, which does not require any hashing mechanism. In contrast, the complex labeling scheme computes the MD5 hash from the signature string. Filling up an MD5 hash value is straightforward for compilers, but it is non-trivial work for humans to maintain. Therefore, we have added new assembler modifiers to compute this value. See also: riscv/riscv-cfi#151

kito-cheng · 2024-04-18T14:07:28Z

@sorear

We should mangle the entire possibly qualified function type, not just the arguments and returns. U9vector_ccFvvE is a different function type from FvvE and confusing them should be caught. I'd rather not make up our own list syntax, as simple as it is, when Itanium provides F types we can use.

Spend time on reading C++ Itanium ABI again, and I tend to agree with you that not make our own list syntax if possible...we don't really want to deal with those edge/complex casees in C++.

I am gonna to moving this forward recently since the simple labeling scheme PoC is pass most of GCC testsuite :)

kito-cheng · 2024-04-23T03:01:55Z

psABI: riscv-non-isa/riscv-elf-psabi-doc#434
asm manual: riscv-non-isa/riscv-asm-manual#99

ved-rivos · 2024-06-05T13:13:14Z

Closing this issue as discussion and development is continuing in the psabi and asm manual PR.

ved-rivos mentioned this issue Oct 1, 2023

Linker relaxation #157

Closed

kito-cheng mentioned this issue Apr 18, 2024

Add %lpad_hash for Zicfilp riscv-non-isa/riscv-asm-manual#99

Draft

ved-rivos closed this as completed Jun 5, 2024

kito-cheng mentioned this issue Oct 18, 2024

Introduce new relocation for landing pad riscv-non-isa/riscv-elf-psabi-doc#452

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft for (complex) hash labeling scheme #151

Draft for (complex) hash labeling scheme #151

kito-cheng commented Aug 30, 2023

yetingk commented Aug 30, 2023 •

edited

Loading

kito-cheng commented Sep 14, 2023

samitolvanen commented Sep 21, 2023

kito-cheng commented Sep 22, 2023

ved-rivos commented Sep 22, 2023

samitolvanen commented Sep 25, 2023

kito-cheng commented Oct 2, 2023

samitolvanen commented Oct 3, 2023

Yangff commented Oct 4, 2023

samitolvanen commented Oct 4, 2023

Yangff commented Oct 4, 2023 •

edited

Loading

samitolvanen commented Oct 4, 2023

Yangff commented Oct 5, 2023

samitolvanen commented Oct 5, 2023

Yangff commented Oct 6, 2023

mylai-mtk commented Jan 31, 2024 •

edited

Loading

ved-rivos commented Jan 31, 2024

kito-cheng commented Feb 1, 2024 •

edited

Loading

ved-rivos commented Feb 1, 2024

mylai-mtk commented Feb 2, 2024 •

edited

Loading

mylai-mtk commented Feb 20, 2024 •

edited

Loading

gcchri commented Feb 20, 2024

sorear commented Feb 27, 2024

mylai-mtk commented Mar 6, 2024

sorear commented Mar 11, 2024

mylai-mtk commented Mar 11, 2024 •

edited

Loading

kito-cheng commented Apr 18, 2024

kito-cheng commented Apr 23, 2024

ved-rivos commented Jun 5, 2024

Draft for (complex) hash labeling scheme #151

Draft for (complex) hash labeling scheme #151

Comments

kito-cheng commented Aug 30, 2023

yetingk commented Aug 30, 2023 • edited Loading

kito-cheng commented Sep 14, 2023

samitolvanen commented Sep 21, 2023

kito-cheng commented Sep 22, 2023

ved-rivos commented Sep 22, 2023

samitolvanen commented Sep 25, 2023

kito-cheng commented Oct 2, 2023

samitolvanen commented Oct 3, 2023

Yangff commented Oct 4, 2023

samitolvanen commented Oct 4, 2023

Yangff commented Oct 4, 2023 • edited Loading

samitolvanen commented Oct 4, 2023

Yangff commented Oct 5, 2023

samitolvanen commented Oct 5, 2023

Yangff commented Oct 6, 2023

mylai-mtk commented Jan 31, 2024 • edited Loading

ved-rivos commented Jan 31, 2024

kito-cheng commented Feb 1, 2024 • edited Loading

ved-rivos commented Feb 1, 2024

mylai-mtk commented Feb 2, 2024 • edited Loading

mylai-mtk commented Feb 20, 2024 • edited Loading

gcchri commented Feb 20, 2024

sorear commented Feb 27, 2024

mylai-mtk commented Mar 6, 2024

sorear commented Mar 11, 2024

mylai-mtk commented Mar 11, 2024 • edited Loading

kito-cheng commented Apr 18, 2024

kito-cheng commented Apr 23, 2024

ved-rivos commented Jun 5, 2024

yetingk commented Aug 30, 2023 •

edited

Loading

Yangff commented Oct 4, 2023 •

edited

Loading

mylai-mtk commented Jan 31, 2024 •

edited

Loading

kito-cheng commented Feb 1, 2024 •

edited

Loading

mylai-mtk commented Feb 2, 2024 •

edited

Loading

mylai-mtk commented Feb 20, 2024 •

edited

Loading

mylai-mtk commented Mar 11, 2024 •

edited

Loading