Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Test against kernel 5.18 #668

Merged
merged 4 commits into from
Jul 20, 2022
Merged

CI: Test against kernel 5.18 #668

merged 4 commits into from
Jul 20, 2022

Conversation

ti-mo
Copy link
Collaborator

@ti-mo ti-mo commented May 10, 2022

After cilium/ci-kernels#24, now test the library against kernel 5.17.

@lmb @joamaki There's one remaining failing selftest:

--- FAIL: TestLibBPFCompat (0.01s)
    --- FAIL: TestLibBPFCompat/netif_receive_skb.linked3.o (0.01s)
        elf_reader_test.go:665: Error during loading: program trace_netif_receive_skb: apply CO-RE relocations: apply fixup target_type_id=67->16253: invalid immediate 73, expected 67 (fixup: target_type_id=67->16253)

Could this be a bug on our side? Nothing was changed to netif_receive_skb.c in over a year, so not sure why this broke now.

fyi @rgo3


@lmb
Copy link
Collaborator

lmb commented May 10, 2022

How urgent is this?

@lmb
Copy link
Collaborator

lmb commented May 11, 2022

target_type_id=67->16253 means we expect to find type id 67 in Instruction.Constant. Instead we find invalid immediate 73.

We can dump the BTF to figure out what types this refers to:

$ bpftool btf dump file netif_receive_skb.linked3.o format raw | grep -F '[67]'
[67] STRUCT 'bpf_insn' size=8 vlen=5
$ bpftool btf dump file netif_receive_skb.linked3.o format raw | grep -F '[73]'
[73] STRUCT 'btf_enum' size=8 vlen=2

We think the instruction should reference a struct bpf_insn but the compiler encoded struct btf_enum. After adding some advanced printf debugging:

$ ./run-tests.sh 5.17 -run TestLibBPFCompat/netif_receive_skb.linked3.o . | grep -A1 -B1 302
offset 18 target_type_id 19 (from wire)
offset 302 target_type_id 67 (from wire)
offset 359 target_type_id 2 (from wire)
--
offset 18 target_type_id of Struct:"sk_buff" (0) local type id 19 (from ext info)
offset 302 target_type_id of Struct:"bpf_insn" (0) local type id 67 (from ext info)
offset 359 target_type_id of Int:"int" (0) local type id 2 (from ext info)
--
offset 18 LdImmDW dst: r2 imm: 19 target_type_id of Struct:"sk_buff" (0) (into metadata)
offset 302 LdImmDW dst: r3 imm: 73 target_type_id of Struct:"bpf_insn" (0) (into metadata)
offset 359 LdImmDW dst: r3 imm: 2 target_type_id of Int:"int" (0) (into metadata)
--
offset 18 LdImmDW dst: r2 imm: 19 target_type_id of Struct:"sk_buff" (0) (from metadata)
offset 302 LdImmDW dst: r3 imm: 73 target_type_id of Struct:"bpf_insn" (0) (from metadata)
offset 359 LdImmDW dst: r3 imm: 2 target_type_id of Int:"int" (0) (from metadata)
--
offset 4258 LdImmDW dst: r3 imm: 83 target_type_id of Struct:"__sk_buff" (0) (from metadata)
offset 302 LdImmDW dst: r3 imm: 73 target_type_id of Struct:"bpf_insn" (0) (apply)
--- FAIL: TestLibBPFCompat (0.02s)

From the last line we know that applying the fixup at offset 302 fails. The BTF relocation (from wire) encodes target type 67 for that offset, which bpftool agrees is struct bpf_insn. As soon as we write it into metadata we can see that the instruction at offset 302 disagrees, the immediate is 73. We can verify that the 73 comes from the ELF, not from a bug in the library:

$ llvm-objdump-12 --section=tp_btf/netif_receive_skb -D netif_receive_skb.linked3.o  | grep ' 302:'     302:	18 03 00 00 49 00 00 00 00 00 00 00 00 00 00 00	r3 = 73 ll

At this point it looks like the encoded instructions and the encoded CO-RE relocations disagree. Maybe this is related to which tool we're using to do the linking? An interesting experiment would be if @ti-mo rebuilt the 5.15 selftests on his machine, and then run the unit test against that. If that changes the outcome its a tooling issue, if not it might be a bug in 5.17.

Another thing to try is running the same example via libbpf and looking at the output it generates for this particular test case.

Diff for my quick hack:

diff --git a/internal/btf/ext_info.go b/internal/btf/ext_info.go
index 2c0e1af..e1c9996 100644
--- a/internal/btf/ext_info.go
+++ b/internal/btf/ext_info.go
@@ -124,6 +124,7 @@ func (ei *ExtInfos) Assign(insns asm.Instructions, section string) {
 		}
 
 		if len(reloInfos) > 0 && reloInfos[0].offset == iter.Offset {
+			fmt.Println("offset", iter.Offset, iter.Ins, reloInfos[0].relo, "(into metadata)")
 			iter.Ins.Metadata.Set(coreRelocationMeta{}, reloInfos[0].relo)
 			reloInfos = reloInfos[1:]
 		}
@@ -610,6 +611,10 @@ type CORERelocation struct {
 	kind     coreKind
 }
 
+func (cr *CORERelocation) String() string {
+	return fmt.Sprintf("%s of %s (%s)", cr.kind, cr.typ, cr.accessor)
+}
+
 func CORERelocationMetadata(ins *asm.Instruction) *CORERelocation {
 	relo, _ := ins.Metadata.Get(coreRelocationMeta{}).(*CORERelocation)
 	return relo
@@ -653,6 +658,7 @@ func newRelocationInfos(brs []bpfCORERelo, ts types, strings *stringTable) ([]co
 		if err != nil {
 			return nil, fmt.Errorf("offset %d: %w", br.InsnOff, err)
 		}
+		fmt.Println("offset", relo.offset, relo.relo, "local type id", br.TypeID, "(from ext info)")
 		rs = append(rs, *relo)
 	}
 	sort.Slice(rs, func(i, j int) bool {
@@ -713,7 +719,7 @@ func parseCOREReloRecords(r io.Reader, bo binary.ByteOrder, recordSize uint32, r
 		// ELF tracks offset in bytes, the kernel expects raw BPF instructions.
 		// Convert as early as possible.
 		relo.InsnOff /= asm.InstructionSize
-
+		fmt.Println("offset", relo.InsnOff, relo.Kind, relo.TypeID, "(from wire)")
 		out = append(out, relo)
 	}
 
diff --git a/linker.go b/linker.go
index 60cb7a6..723077c 100644
--- a/linker.go
+++ b/linker.go
@@ -117,11 +117,14 @@ func findReferences(progs map[string]*ProgramSpec) error {
 func applyRelocations(insns asm.Instructions, local, target *btf.Spec) error {
 	var relos []*btf.CORERelocation
 	var reloInsns []*asm.Instruction
+	var reloOffsets []asm.RawInstructionOffset
 	iter := insns.Iterate()
 	for iter.Next() {
 		if relo := btf.CORERelocationMetadata(iter.Ins); relo != nil {
+			fmt.Println("offset", iter.Offset, iter.Ins, relo, "(from metadata)")
 			relos = append(relos, relo)
 			reloInsns = append(reloInsns, iter.Ins)
+			reloOffsets = append(reloOffsets, iter.Offset)
 		}
 	}
 
@@ -144,6 +147,7 @@ func applyRelocations(insns asm.Instructions, local, target *btf.Spec) error {
 
 	for i, fixup := range fixups {
 		if err := fixup.Apply(reloInsns[i]); err != nil {
+			fmt.Println("offset", reloOffsets[i], reloInsns[i], relos[i], "(apply)")
 			return fmt.Errorf("apply fixup %s: %w", &fixup, err)
 		}
 	}

@ti-mo
Copy link
Collaborator Author

ti-mo commented May 16, 2022

@lmb thanks for the insights! Will take a closer look in a bit.

@lmb
Copy link
Collaborator

lmb commented Jun 15, 2022

I re-ran the tests against 5.17 that was built with ci-kernels-builder and the result stays the same!

@lmb lmb force-pushed the tb/ci-kernel-517 branch 3 times, most recently from c3f46b1 to f4ed59b Compare June 15, 2022 17:14
@lmb lmb changed the title CI: Test against kernel 5.17 CI: Test against kernel 5.18 Jun 15, 2022
@lmb lmb force-pushed the tb/ci-kernel-517 branch from f4ed59b to 56a0334 Compare June 15, 2022 17:16
@lmb
Copy link
Collaborator

lmb commented Jun 15, 2022

Updated this for 5.18, the error still persists. Also tried updating pahole in the build container to https://packages.debian.org/bullseye-backports/pahole but that didn't change things either.

@ti-mo
Copy link
Collaborator Author

ti-mo commented Jul 20, 2022

Wow, really interesting failure on go1.17:

Testing on 5.18
--- FAIL: TestLibBPFCompat (0.13s)
    --- FAIL: TestLibBPFCompat/strncmp_bench.linked3.o (0.01s)
        elf_reader_test.go:707: Can't read strncmp_bench.linked3.o: file /run/input/bpf/strncmp_bench.linked3.o: load data sections: data section .bss: can't get contents: unexpected EOF

This CL shipped in Go 1.18: https://go-review.googlesource.com/c/go/+/375216. Since .bss is a 'virtual' section, it's not actually allocated in the ELF (nothing shocking), it always shares an offset with another section, e.g.:

  [ 5] .bss              NOBITS          0000000000000000 001428 001008 00  WA  0   0  8
  [ 6] license           PROGBITS        0000000000000000 001428 000004 00  WA  0   0  1

(offsets are hex)

Since strncmp_bench.linked3.c allocates an empty 4k array, the size of .bss in that binary is set to 0x1008 (4104 bytes) and at an offset of 0x1428 (5160 bytes), this causes the stdlib to read past the end of the file. (7272 bytes big in 5.18 selftests) Opening the .bss section of this binary actually gets you a reader into the license section, although only .bss-length amount of bytes can be read.

In conclusion, in Go 1.17 and earlier, reading NOBITS sections actually gets you the bytes of another section that sits at the same offset. We should probably mimic this behaviour in the lib as well; there is no reason to try and read NOBITS sections to begin with. Maybe we don't even need to allocate + populate their MapSpec.Contents, but I'd need to refresh my memory on the relo handling.

@lmb WDYT?

@ti-mo
Copy link
Collaborator Author

ti-mo commented Jul 20, 2022

Turns out they do get loaded into the kernel and even get read from bytecode, although all at index 0. Seems wasteful to create large maps for no reason, but that's for the compiler to decide.

For now, this should fix it: #740.

rgo3 and others added 4 commits July 20, 2022 14:33
Signed-off-by: Robin Gögge <r.goegge@gmail.com>
Signed-off-by: Robin Gögge <r.goegge@gmail.com>
5.18 added two new BTF kinds TYPE_TAG and DECL_TAG which we
don't support at the moment.

See cilium#713
Trying to load the ELF gives the following error:

    elf_reader_test.go:665: Error during loading: program trace_netif_receive_skb:
    apply CO-RE relocations: apply fixup target_type_id=67->16253:
    invalid immediate 73, expected 67 (fixup: target_type_id=67->16253)

After some cursory digging this doesn't seem to be a bug in the library
but maybe one in either pahole or clang.

See cilium#739
@lmb lmb force-pushed the tb/ci-kernel-517 branch from 0571ab3 to 791558a Compare July 20, 2022 14:34
@ti-mo ti-mo merged commit b18a814 into cilium:master Jul 20, 2022
@ti-mo ti-mo deleted the tb/ci-kernel-517 branch July 20, 2022 14:47
@lmb
Copy link
Collaborator

lmb commented Jul 20, 2022

Finally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants