Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register naming in Capstone 5 has changed for ARM. #2078

Closed
gerph opened this issue Jul 9, 2023 · 9 comments
Closed

Register naming in Capstone 5 has changed for ARM. #2078

gerph opened this issue Jul 9, 2023 · 9 comments

Comments

@gerph
Copy link

gerph commented Jul 9, 2023

This isn't so much a bug report as a 'there's a change in behaviour... did you know?' report.

The difference

I have an operating system which uses Capstone as its disassembly system (for reporting faults, etc). The output of the disassembly is used as expectations for the tests. This means that its test output (and, obviously, Capstone's output) must remain the same between runs to ensure that the expectations are met. They started failing once Capstone 5 was released, because the representation of registers has changed for ARM.

Specifically, I'm seeing that register 13 in ARM which was reported as sp is now being represented as r13 (when CS_OPT_SYNTAX_NOREGNAME is in force)

This isn't a problem for me per-se... although I would prefer to see sp as the name of the register, but we can accept r13 although it's not as nice. There isn't a way to rename registers from within the application, so I do not appear to be able to revert the behaviour to what it was before - I can do a search and replace, however that's a little more expensive.

To be clear about the problem, here is the behaviour of disassembling the instruction LDR r1, [sp, #4] with both capstone 4 and 5:

Capstone 4

charles@laputa ~/projects/RO/pyromaniac (master)> pip install -U 'capstone<5'
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting capstone<5
Installing collected packages: capstone
  Found existing installation: capstone 5.0.0.post1
    Uninstalling capstone-5.0.0.post1:
      Successfully uninstalled capstone-5.0.0.post1
Successfully installed capstone-4.0.2
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (4, 0, 1024)

0x1000:	ldr	r1, [sp, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)

Capstone 5

charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (5, 0, 1280)

0x1000:	ldr	r1, [r13, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)

Test program to generate the above output

This is my general disassembly tool for investigating the contents of the capstone output; it's a little wordy, but the important bit is the md.syntax = CS_OPT_SYNTAX_NOREGNAME and that the instruction being decoded is b'\x04\x10\x9d\xe5', (LDR r1,[r13, #4]).

#!/usr/bin/env python

import sys

from capstone import *
import capstone.arm_const

reg_map = [
        capstone.arm_const.ARM_REG_R0,
        capstone.arm_const.ARM_REG_R1,
        capstone.arm_const.ARM_REG_R2,
        capstone.arm_const.ARM_REG_R3,
        capstone.arm_const.ARM_REG_R4,
        capstone.arm_const.ARM_REG_R5,
        capstone.arm_const.ARM_REG_R6,
        capstone.arm_const.ARM_REG_R7,
        capstone.arm_const.ARM_REG_R8,
        capstone.arm_const.ARM_REG_R9,
        capstone.arm_const.ARM_REG_R10,
        capstone.arm_const.ARM_REG_R11,
        capstone.arm_const.ARM_REG_R12,
        capstone.arm_const.ARM_REG_SP,
        capstone.arm_const.ARM_REG_LR,
        capstone.arm_const.ARM_REG_PC,
    ]
inv_reg_map = dict((regval, regnum) for regnum, regval in enumerate(reg_map))

shift_names = {
        capstone.arm_const.ARM_SFT_INVALID: None,
        capstone.arm_const.ARM_SFT_ASR: 'ASR',
        capstone.arm_const.ARM_SFT_ASR_REG: 'ASR',
        capstone.arm_const.ARM_SFT_LSL: 'LSL',
        capstone.arm_const.ARM_SFT_LSL_REG: 'LSL',
        capstone.arm_const.ARM_SFT_LSR: 'LSR',
        capstone.arm_const.ARM_SFT_LSR_REG: 'LSR',
        capstone.arm_const.ARM_SFT_ROR: 'ROR',
        capstone.arm_const.ARM_SFT_ROR_REG: 'ROR',
        capstone.arm_const.ARM_SFT_RRX: 'RRX',
        capstone.arm_const.ARM_SFT_RRX_REG: 'RRX'
    }

optype_names = dict((getattr(capstone.arm_const, optype), optype) for optype in dir(capstone.arm_const) if optype.startswith('ARM_OP_'))

md = Cs(CS_ARCH_ARM, CS_MODE_ARM)
md.detail = True
md.mnemonic_setup(capstone.arm_const.ARM_INS_SVC, "SWI")
# Turn off APCS register naming
md.syntax = capstone.CS_OPT_SYNTAX_NOREGNAME

last_i = None

def show_disasm(code):
    global last_i
    for i in md.disasm(code, 0x1000):
        last_i = i
        print("")
        print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
        for index, operand in enumerate(i.operands):
            print("  op#%i: type=%i (%s)" % (index, operand.type, optype_names.get(operand.type, 'unknown')))
            if operand.type == capstone.arm_const.ARM_OP_IMM:
                print("        imm = %i" % (operand.imm,))
            if operand.type == capstone.arm_const.ARM_OP_REG:
                print("        reg = %i (R%s)" % (operand.reg, inv_reg_map[operand.reg]))
            if operand.type == capstone.arm_const.ARM_OP_MEM:
                print("        base = %i (R%s)" % (operand.mem.base, inv_reg_map.get(operand.mem.base, 'unknown')))
                print("        index = %i (R%s)" % (operand.mem.index, inv_reg_map.get(operand.mem.index, 'unknown')))
                print("        disp = %i" % (operand.mem.disp,))
                print("        lshift = %i (R%s)" % (operand.mem.lshift, inv_reg_map.get(operand.mem.lshift, 'unknown')))
            if operand.shift.type != capstone.arm_const.ARM_SFT_INVALID:
                if operand.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                                          capstone.arm_const.ARM_SFT_LSR,
                                          capstone.arm_const.ARM_SFT_ASR,
                                          capstone.arm_const.ARM_SFT_ROR):
                    sname = shift_names[operand.shift.type]
                    print("        shift = %s #%i" % (sname, operand.shift.value))
                elif operand.shift.type in (capstone.arm_const.ARM_SFT_LSL_REG,
                                            capstone.arm_const.ARM_SFT_LSR_REG,
                                            capstone.arm_const.ARM_SFT_ASR_REG,
                                            capstone.arm_const.ARM_SFT_ROR_REG):
                    sname = shift_names[operand.shift.type]
                    reg = inv_reg_map[operand.shift.value]
                    print("        shift = %s R%s" % (sname, reg))
                else:
                    print("        shift = type=%i value=%i" % (operand.shift.type, operand.shift.value))

def insn__repr__(self):
    word = bytes(bytearray(reversed(list(self.bytes)))).encode('hex')
    return "<{}(word=0x{}, {} operands)>".format(self.__class__.__name__, word, len(self.operands))
capstone.CsInsn.__repr__ = insn__repr__

def armop__repr__(self):
    params = ['type={}'.format(optype_names.get(self.type, 'unknown'))]
    if self.type == capstone.arm_const.ARM_OP_IMM:
        params.append('imm={}'.format(self.imm))
    elif self.type == capstone.arm_const.ARM_OP_REG:
        params.append('reg={}'.format(inv_reg_map[self.reg]))
    elif self.type == capstone.arm_const.ARM_OP_MEM:
        params.append('basereg={}'.format(inv_reg_map.get(self.mem.base, 'unknown')))
        params.append('indexreg={}'.format(inv_reg_map.get(self.mem.index, 'unknown')))
        params.append('displacement={}'.format(self.mem.disp))
        params.append('lshift={}'.format(self.mem.lshift))
    if self.shift.type != capstone.arm_const.ARM_SFT_INVALID:
        if self.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                               capstone.arm_const.ARM_SFT_LSR,
                               capstone.arm_const.ARM_SFT_ASR,
                               capstone.arm_const.ARM_SFT_ROR):
            sname = shift_names[self.shift.type]
            params.append("shift={} #{}".format(sname, self.shift.value))
        else:
            params.append("shift=type{} #{}".format(self.shift.type, self.shift.value))
    return "<{}({})>".format(self.__class__.__name__, ', '.join(params))
capstone.arm.ArmOp.__repr__ = armop__repr__

print("cs_version() = %r" % (cs_version(),))

one_example = False
if len(sys.argv) == 2:
    try:
        one_example = int(sys.argv[1])
    except ValueError:
        sys.exit("Syntax: %s <example-number>" % (sys.argv[0],))

examples = [
        b'\x05\x00\x00\xef', # SWI 5
        b'\x20\x00\x50\xe3', # CMP r0, #&20
        b'\x40\x00\x9f\x05', # LDREQ   r0,[pc,#64]
        b'\x05\x00\x00\x2f', # SWI 5
        b'\x08\x00\x00\xeb', # BL pc+8*4
        b'\xba\x50\x8f\xb2', # ADDLT r5, pc, #186
        b'\x6C\x43\x9f\xE5', # LDR r4, [pc, #&36c]
        b'\x0b\xb0\x97\xe7', # LDR     r11, [r7, r11]
        b'\x04\x00\x5f\xe5', # LDRB r0, [pc, #4]
        b'\x03\x00\x92\xe8', # LDMIA   r2, {r0, r1}
        b'\x03\x00\x92\xd8', # LDMLEIA r2, {r0, r1}
        b'\x00\x18\xa0\xe1', # LSL r1, r0, #&10 => MOV r1, r0, LSL #16
        b'\x21\x18\xa0\xe1', # LSR r1, r1, #&10 => MOV r1, r1, LSR #16
        b'\x26\xc4\xb0\xe1', # LSRS r12, r6, #8 => MOVS r12, r6, LSR #8
        b'\x12\x13\xa0\xe1', # LSL r1, r2, r3   => MOV r1, r2, LSL r3
        b'\x52\x13\xa0\xe1', # ASR r1, r2, r3   => MOV r1, r2, ASR r3
        b'\x62\x10\xa0\xe1', # RRX r1, r2       => MOV r1, r2, RRX
        b'\x53\x30\xeb\xe7', # ?
        b'\x01\x0f\x81\xe2', # ADD r0, r1, #1, #30  => ADD r0, r1, #2
        b'\x1e\x10\x81\x11', # ORRNE r1, r1, r14, LSL r0
        b'\x06\x10\xe0\xe3', # MVN r1,#&6
        b'\x02\x10\x9f\xe7', # LDR r1,[pc,r2]
        b'\x04\x10\x9d\xe5', # LDR r1,[r13, #4]
    ]
if one_example is False:
    for code in examples:
        show_disasm(code)
else:
    code = examples[one_example]
    show_disasm(code)

Cause of the change

In v4, the decoding was performed by the getRegisterName2 function for the CS_OPT_SYNTAX_NOREGRNAME in ARMGenAsmWriter.inc, which for register id 12 (see above that the base register has the value of 12) we get out the string sp:

https://github.com/capstone-engine/capstone/blob/v4/arch/ARM/ARMGenAsmWriter.inc#L8634C1-L8634C26

And in the v5 code, the decoding is performed by the getRegisterName_digit in ARMGenRegisterName_digit.inc, and again we use register id 12 (again the base register number is 12) which has a string r13.

https://github.com/capstone-engine/capstone/blob/v5/arch/ARM/ARMGenRegisterName_digit.inc#L77

Obviously these two files are automatically generated, and arguably the use of r13 when you're not using the register naming schemes is more accurate. However, except for APCS_U, register 13 has always been the stack pointer - I believe under APCS_U the stack pointer was in r12, and unless you're using RISCiX you're not going to care about APCS_U. In all other cases, I believe r13 has the convention of being the stack pointer - and if you're interworking with Thumb, it must be a stack pointer.

Expected behaviour

I expected the behaviour of the output to not change between versions, but it's not a strong expectation, as this is a major version update. It would have been nice if the change in register names had been included in the 5.0 change notes in https://github.com/capstone-engine/capstone/releases - just to be clear that it had updated.

What would be nice would be if it were possible to rename registers dynamically, but I suspect that's not going to be easy.

I intend to include a special case to rename r13 to sp when disassembling, to retain the old behaviour, if capstone 5 is detected, although I'm not convinced myself that this is a good idea in the long term - that's my problem, not yours.

I just wanted to highlight that there is a change in behaviour and that it was unexpected. It's not necessarily a bug unless you are guaranteeing the output format is unchanging between major releases.

@gerph
Copy link
Author

gerph commented Jul 9, 2023

In looking at this further, this also applies to r14 being returned with the string r14, where in Capstone 4 it was returned as lr.

@aquynh
Copy link
Collaborator

aquynh commented Jul 10, 2023 via email

@XVilka
Copy link
Contributor

XVilka commented Jul 10, 2023

@gerph just for the record, for ARM there will be more changes in the 6.0 version (if it ever gets merged): #1949

@Rot127
Copy link
Collaborator

Rot127 commented Jul 11, 2023

Also these alias will return in #1949. There it is currently directly patched in the string. Although only enabled via an option. Because it wouldn't mimic the llvm-objdump output by default otherwise.

@gerph
Copy link
Author

gerph commented Jul 15, 2023

I'm unsure how to actually generate the tables. According to synctools I need to run things inside tablegen... but when I do, I see problems with the files... for example if I use clang 16 through docker I can do this:

charles@phonewave ~/external/capstone/suite/synctools/tablegen (next)> docker run -it --rm -v $HOME/external/capstone/:/src -w /src silkeh/clang:16
root@6de6340284d4:/src# cd suite/synctools/tablegen/
root@6de6340284d4:/src/suite/synctools/tablegen# ./gen-tablegen-arch.sh /usr/bin/ ARM
Using llvm-tblgen from /usr/bin/
Generating ARMGenInstrInfo.inc
Included from ARM/ARM.td:1059:
ARM/ARMRegisterInfo.td:81:5: error: Field 'CostPerUse' of type 'list<int>' is incompatible with value '1' of type 'int'
let CostPerUse = 1 in {
    ^

(and so on for each CostPerUse assignment)

This appears to be due to https://reviews.llvm.org/D86836 which happened back in 2020, and which requires the 1 in the above to be [1]. However, if I make that change...

root@6de6340284d4:/src/suite/synctools/tablegen# ./gen-tablegen-arch.sh /usr/bin/ ARM
Using llvm-tblgen from /usr/bin/
Generating ARMGenInstrInfo.inc
Included from ARM/ARM.td:1067:
ARM/ARMInstrInfo.td:220:34: error: Value specified for template argument 'AssemblerPredicate:cond' (#0) is of type string; expected type dag: "HasV4TOps"
                                 AssemblerPredicate<"HasV4TOps", "armv4t">;
                                 ^

(and repeat for other files)

So presumably there's a specific version of llvm-tablegen that is needed - I'm not clear what. If I leave the code unmodified and try building with other versions of clang I still cannot get it to build, even going all the way back to version 10.

When I get to version 9, I get errors like:

root@d6ea28f5fa4e:/src/suite/synctools/tablegen# ./gen-tablegen-arch.sh /usr/bin/ ARM
Using llvm-tblgen from /usr/bin/
Generating ARMGenInstrInfo.inc
Included from ARM/ARM.td:17:
Included from include/llvm/Target/Target.td:15:
include/llvm/IR/Intrinsics.td:20:42: error: Variable not defined: 'false'
class IntrinsicProperty<bit is_default = false> {
                                         ^
Included from ARM/ARM.td:17:
Included from include/llvm/Target/Target.td:15:
include/llvm/IR/Intrinsics.td:30:17: error: Value not specified for template argument #0 (IntrinsicProperty:is_default) of subclass 'IntrinsicProperty'!
def IntrNoMem : IntrinsicProperty;
                ^

which makes me think that something I've gone too far.

So whilst I can see how to change the source files, I cannot see how to make the tables using the table generator code.

The following should be all that's needed:

charles@phonewave ~/external/capstone/suite/synctools/tablegen (next)> git diff
diff --git a/suite/synctools/tablegen/ARM/ARMRegisterInfo-digit.td b/suite/synctools/tablegen/ARM/ARMRegisterInfo-digit.td
index 3076bfc8..368f0631 100644
--- a/suite/synctools/tablegen/ARM/ARMRegisterInfo-digit.td
+++ b/suite/synctools/tablegen/ARM/ARMRegisterInfo-digit.td
@@ -84,8 +84,10 @@ def R9  : ARMReg< 9, "r9">,  DwarfRegNum<[9]>;
 def R10 : ARMReg<10, "r10">, DwarfRegNum<[10]>;
 def R11 : ARMReg<11, "r11">, DwarfRegNum<[11]>;
 def R12 : ARMReg<12, "r12">, DwarfRegNum<[12]>;
-def SP  : ARMReg<13, "r13">,  DwarfRegNum<[13]>;
-def LR  : ARMReg<14, "r14">,  DwarfRegNum<[14]>;
+// R13 and R14 were given names in Capstone 4, even when the CS_OPT_SYNTAX_NOREGNAME option was specified,
+// so here they are given these names as well. See https://github.com/capstone-engine/capstone/issues/2078
+def SP  : ARMReg<13, "sp">,  DwarfRegNum<[13]>;
+def LR  : ARMReg<14, "lr">,  DwarfRegNum<[14]>;
 def PC  : ARMReg<15, "pc">,  DwarfRegNum<[15]>;
 }

... but I cannot work out how to make that get generated into all the tables.

However, I can patch it in the table with a couple of small changes:

charles@phonewave ~/external/capstone/bindings/python (next)> git diff
diff --git a/arch/ARM/ARMGenRegisterName_digit.inc b/arch/ARM/ARMGenRegisterName_digit.inc
index c4501030..ebd6bce6 100644
--- a/arch/ARM/ARMGenRegisterName_digit.inc
+++ b/arch/ARM/ARMGenRegisterName_digit.inc
@@ -74,7 +74,7 @@ static const char *getRegisterName_digit(unsigned RegNo)
   /* 481 */ 'Q', '1', '0', '_', 'Q', '1', '1', '_', 'Q', '1', '2', '_', 'Q', '1', '3', 0,
   /* 497 */ 'd', '1', '3', 0,
   /* 501 */ 'q', '1', '3', 0,
-  /* 505 */ 'r', '1', '3', 0,
+  /* 505 */ 's', 'p', 0, 0,
   /* 509 */ 's', '1', '3', 0,
   /* 513 */ 'D', '1', '7', '_', 'D', '1', '9', '_', 'D', '2', '1', '_', 'D', '2', '3', 0,
   /* 529 */ 'D', '2', '1', '_', 'D', '2', '2', '_', 'D', '2', '3', 0,
@@ -93,7 +93,7 @@ static const char *getRegisterName_digit(unsigned RegNo)
   /* 625 */ 'Q', '1', '1', '_', 'Q', '1', '2', '_', 'Q', '1', '3', '_', 'Q', '1', '4', 0,
   /* 641 */ 'd', '1', '4', 0,
   /* 645 */ 'q', '1', '4', 0,
-  /* 649 */ 'r', '1', '4', 0,
+  /* 649 */ 'l', 'r', 0, 0,
   /* 653 */ 's', '1', '4', 0,
   /* 657 */ 'D', '1', '8', '_', 'D', '2', '0', '_', 'D', '2', '2', '_', 'D', '2', '4', 0,
   /* 673 */ 'D', '2', '1', '_', 'D', '2', '2', '_', 'D', '2', '3', '_', 'D', '2', '4', 0,

Obviously fixing it properly in the tables would be better, but this appears to do the job for my test program

@Rot127
Copy link
Collaborator

Rot127 commented Jul 18, 2023

@gerph If you need this within v5 just fix it by hand and open a PR. From v6 on we should (hopefully) only use auto-sync for updates (see this doc)

@XVilka
Copy link
Contributor

XVilka commented Jul 21, 2023

@gerph as there are plans to make a patch release in 5.x series, please send a PR fixing these directly.

gerph added a commit to gerph/capstone that referenced this issue Jul 22, 2023
In v5 the register naming for ARM registers changed. This was not a
documented behaviour, and may not be desireable.

This change makes the register names in v5 the same as they were in
v4 - that is that when NOREGNAME is used, R13 and R14 are known as
`sp` and `lr`. They had been changed to `r13` and `r14`.

See capstone-engine#2078 for more
details.
gerph added a commit to gerph/capstone that referenced this issue Jul 22, 2023
In v5 the register naming for ARM registers changed. This was not a
documented behaviour, and may not be desireable.

This change makes the register names in v5 the same as they were in
v4 - that is that when NOREGNAME is used, R13 and R14 are known as
`sp` and `lr`. They had been changed to `r13` and `r14`.

See capstone-engine#2078 for more
details.
@gerph
Copy link
Author

gerph commented Jul 22, 2023

PR created: #2108

@XVilka
Copy link
Contributor

XVilka commented Jul 25, 2023

@gerph @kabeor as PR was merged this can be closed, I suppose

@kabeor kabeor closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants