-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools/virtiostat: add virtiostat tool #3270
Conversation
[buildbot, test this please] |
@yonghong-song
I tried to build ubuntu-1710 environment to reproduce this issue, but apt source has been broken. Could you give me a hint why this test fails? |
I could be a timing issue. networking packet did not go through yet and then socket is closed. Let me try again. It should not be related to your patch. |
[buildbot, test this please] |
tools/virtiostat.py
Outdated
#include <bcc/proto.h> | ||
|
||
#define VIRTIO_MAX_SGS 128 | ||
#define SG_MAX 18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments where these constants come from?
tools/virtiostat.py
Outdated
static void record(struct virtqueue *vq, | ||
struct scatterlist **sgs, | ||
unsigned int out_sgs, | ||
unsigned int in_sgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need in four different lines for parameters, you have two lines with each line two parameters.
unsigned int in_sgs, | ||
void *data, | ||
gfp_t gfp) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line is not needed.
tools/virtiostat.py
Outdated
unsigned int out_sgs, | ||
unsigned int in_sgs, | ||
void *data, | ||
gfp_t gfp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need one parameter per line.
tools/virtiostat.py
Outdated
|
||
{ | ||
unsigned int _out_sgs = PT_REGS_PARM3(ctx); | ||
unsigned int _in_sgs = PT_REGS_PARM4(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you need the above two lines to get _out_sgs and _in_sgs? They are in argument list and you should be able to use out_sgs and in_sgs directly?
gfp_t gfp) | ||
{ | ||
record(vq, &sg, 0, 1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for extra line
tools/virtiostat.py
Outdated
else: | ||
print("--------", end="\n") | ||
|
||
print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long line, break into two lines?
tools/virtiostat.py
Outdated
print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n") | ||
stats = b.get_table("stats") | ||
for k, v in sorted(stats.items(), key=lambda vs: vs[1].dev): | ||
print("%16s %12s %12s %12d %12d %16d %16d" % (v.driver, v.dev, v.vqname, v.in_sgs, v.out_sgs, v.in_bw, v.out_bw), end="\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break into two lines?
tools/virtiostat_example.txt
Outdated
-------- | ||
Driver Device VQ Name In SGs Out SGs In BW Out BW | ||
virtio_net virtio0 output.2 0 9 0 832 | ||
virtio_net virtio0 output.3 0 92 0 13053 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us ensure default output has at most 80 char's per line. Here we exceeded it.
tools/virtiostat_example.txt
Outdated
|
||
|
||
This program traces virtio devices to analyze the IO operations and | ||
throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can give more explanation why these statistics are useful to find/debug a real issue. Real use case justification is needed to be included in bcc tools.
@yonghong-song
Add some comments in code, and I'm not sure if I hit a bug....
|
@pizhenwei thanks for reporting. Indeed, the incorrect param value is caused by a bug in bcc rewriter. I have just put a fix #3275. When I try to run, I hit the following verifier bug:
I am using llvm12 and llvm13(trunk) and both have issues. Let me dig more on this. I suspect this is related to the following code
where verifier might not handle effectively. |
I did some analysis on verifier failure. The reason is due to the following code:
Insn 100 saves r8, but r8 has conservative value at insn 91 (suppose insn 96 evaluates true) compared to r1. Such a conservative value will make verifier think the loop can execute up to u64_max which will cause verification failure. We do have a pass in llvm12/13 intended to prevent such optimization. I will study more with llvm on how to prevent this transformation. Looks like you are using an early version of llvm which didn't do the above transformation.. |
Cool tool, thanks! The man page is missing a description of the columns. A later addition could be a -x mode for extended stats, to include average latency (similar to iostat). |
I tested this on llvm-3.8, it worked fine. And tested on llvm-12, also reproduced this verification failure. |
tools/virtiostat.py
Outdated
} | ||
} | ||
|
||
int trace_virtqueue_add_sgs(struct pt_regs *ctx, struct virtqueue *_vq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change _vq to vq, so argument, so it can be consistent with rest of arguments.
tools/virtiostat.py
Outdated
{ | ||
/* NOTICE: to make sure your bcc is built with commit | ||
b2317868af8c6b81a5b5065237589743f7a1168d */ | ||
record(_vq, sgs, out_sgs, in_sgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_vq => vq
tools/virtiostat.py
Outdated
|
||
{ | ||
/* NOTICE: to make sure your bcc is built with commit | ||
b2317868af8c6b81a5b5065237589743f7a1168d */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the above comment. This commit will be merged on top of the above commit, so no need to mention that.
@pizhenwei please also incorporate the following changes in your next revision. llvm and kernel work to address these issues are ongoing. Thanks!
|
[buildbot, ok to test] |
Please consider to address the following comments from @brendangregg
|
I tried using it, and as an iostat-like tool I found the options weren't what I expected. Instead of:
I'd consider:
ie
for consistency with the other stat tools. Generally I'd reserve the tool arguments for the most common use case, and then options (like -T) for everything else. That saves typing. I think the most common use case here would be setting the interval and count, so we can drop the -i and -d. (note that this switches a duration to a count.) |
I ran |
Add some comment in description about
Let's make sure the basic version work fine firstly. To implement -x mode needs record the request starting time and done time (to track another function). So in my plan, I'd like to implement this in another PR. |
if (out_sgs) | ||
out_bw = count_len(sgs, out_sgs); | ||
if (in_sgs) | ||
in_bw = count_len(sgs + out_sgs, in_sgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.
Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.
I tested this with LLVM version 12.0.0, it worked fine .... But I still modify it as your suggestion.
LLVM (http://llvm.org/):
LLVM version 12.0.0
Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: skylake
Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.
Yes, this version is correct.
From https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md:
Other tools use -i -d when there is a different common case. E.g., funccount(8), where the common case is the function you want to trace, leaving interval/duration for the -i -d switches. |
BTW, command line tools, on both Unix and Windows, frequently use this convention. For example, this is not their usage:
They have dropped the switch for the common case. |
Add a new tool virtiostat to trace VIRTIO devices IO statistics. Although we have already had iostat(to show block device statistics), iftop(to show network device statistics), other devices of VIRTIO family(Ex, console, balloon, GPU and so on) also need tools for each type. virtiostat works in virtio lower layer, and it could trace all the devices. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Force pushed as you suggested. Thanks a lot.
|
LGTM. Thanks! |
Thanks! |
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 torvalds#10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... torvalds#10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The orignal bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 405099 insns (limit 1000000) max_states_per_insn 92 total_states 8866 peak_states 889 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 torvalds#10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... torvalds#10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues. Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in llvm patch https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above llvm patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above llvm fix and also provide a test case for further analyzing the verifier pruning issue. Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Yonghong Song <yhs@fb.com>
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues: Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in LLVM patch: https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above LLVM patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got: $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got: $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above LLVM fix and also provide a test case for further analyzing the verifier pruning issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Zhenwei Pi <pizhenwei@bytedance.com> Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn llvm#15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn llvm#24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479
The original bcc pull request iovisor/bcc#3270 exposed a verifier failure with Clang 12/13 while Clang 4 works fine. Further investigation exposed two issues: Issue 1: LLVM may generate code which uses less refined value. The issue is fixed in LLVM patch: https://reviews.llvm.org/D97479 Issue 2: Spills with initial value 0 are marked as precise which makes later state pruning less effective. This is my rough initial analysis and further investigation is needed to find how to improve verifier pruning in such cases. With the above LLVM patch, for the new loop6.c test, which has smaller loop bound compared to original test, I got: $ test_progs -s -n 10/16 ... stack depth 64 processed 390735 insns (limit 1000000) max_states_per_insn 87 total_states 8658 peak_states 964 mark_read 6 #10/16 loop6.o:OK Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got: $ test_progs -s -n 10/16 ... BPF program is too large. Processed 1000001 insn stack depth 64 processed 1000001 insns (limit 1000000) max_states_per_insn 91 total_states 23176 peak_states 5069 mark_read 6 ... #10/16 loop6.o:FAIL The purpose of this patch is to provide a regression test for the above LLVM fix and also provide a test case for further analyzing the verifier pruning issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Zhenwei Pi <pizhenwei@bytedance.com> Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn flang-compiler#15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn flang-compiler#24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 1959ead)
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479 (cherry picked from commit 0bc0084)
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479
Add a new tool virtiostat to trace VIRTIO devices IO statistics.
Although we have already had iostat(to show block device statistics),
iftop(to show network device statistics), other devices of VIRTIO
family(Ex, console, balloon, GPU and so on) also need tools for each
type. virtiostat works in virtio lower layer, and it could trace all
the devices.
Signed-off-by: zhenwei pi pizhenwei@bytedance.com