Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools/virtiostat: add virtiostat tool #3270

Merged
merged 1 commit into from
Feb 22, 2021
Merged

tools/virtiostat: add virtiostat tool #3270

merged 1 commit into from
Feb 22, 2021

Conversation

pizhenwei
Copy link
Contributor

Add a new tool virtiostat to trace VIRTIO devices IO statistics.

Although we have already had iostat(to show block device statistics),
iftop(to show network device statistics), other devices of VIRTIO
family(Ex, console, balloon, GPU and so on) also need tools for each
type. virtiostat works in virtio lower layer, and it could trace all
the devices.

Signed-off-by: zhenwei pi pizhenwei@bytedance.com

@yonghong-song
Copy link
Collaborator

[buildbot, test this please]

@pizhenwei
Copy link
Contributor Author

[buildbot, test this please]

@yonghong-song
Hi, testbot reports an error on ubuntu1710 (and only on ubuntu1710)

8: FAIL: test_jumps (__main__.TestBPFSocket)
8: ----------------------------------------------------------------------
8: Traceback (most recent call last):
8:   File "/home/iovisor/jenkins/workspace/bcc-pr/label/ubuntu1710/tests/python/test_call1.py", line 43, in test_jumps
8:     self.assertGreater(self.stats[c_int(S_IP)].value, 0)
8: AssertionError: 0L not greater than 0

I tried to build ubuntu-1710 environment to reproduce this issue, but apt source has been broken. Could you give me a hint why this test fails?

@yonghong-song
Copy link
Collaborator

I could be a timing issue. networking packet did not go through yet and then socket is closed. Let me try again. It should not be related to your patch.

@yonghong-song
Copy link
Collaborator

[buildbot, test this please]

#include <bcc/proto.h>

#define VIRTIO_MAX_SGS 128
#define SG_MAX 18
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments where these constants come from?

static void record(struct virtqueue *vq,
struct scatterlist **sgs,
unsigned int out_sgs,
unsigned int in_sgs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need in four different lines for parameters, you have two lines with each line two parameters.

unsigned int in_sgs,
void *data,
gfp_t gfp)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line is not needed.

unsigned int out_sgs,
unsigned int in_sgs,
void *data,
gfp_t gfp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need one parameter per line.


{
unsigned int _out_sgs = PT_REGS_PARM3(ctx);
unsigned int _in_sgs = PT_REGS_PARM4(ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you need the above two lines to get _out_sgs and _in_sgs? They are in argument list and you should be able to use out_sgs and in_sgs directly?

gfp_t gfp)
{
record(vq, &sg, 0, 1);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for extra line

else:
print("--------", end="\n")

print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long line, break into two lines?

print("%16s %12s %12s %12s %12s %16s %16s" % ("Driver", "Device", "VQ Name", "In SGs", "Out SGs", "In BW", "Out BW"), end="\n")
stats = b.get_table("stats")
for k, v in sorted(stats.items(), key=lambda vs: vs[1].dev):
print("%16s %12s %12s %12d %12d %16d %16d" % (v.driver, v.dev, v.vqname, v.in_sgs, v.out_sgs, v.in_bw, v.out_bw), end="\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break into two lines?

--------
Driver Device VQ Name In SGs Out SGs In BW Out BW
virtio_net virtio0 output.2 0 9 0 832
virtio_net virtio0 output.3 0 92 0 13053
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us ensure default output has at most 80 char's per line. Here we exceeded it.



This program traces virtio devices to analyze the IO operations and
throughput.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can give more explanation why these statistics are useful to find/debug a real issue. Real use case justification is needed to be included in bcc tools.

@pizhenwei
Copy link
Contributor Author

@yonghong-song
Refine codes/docs as you suggested except

    unsigned int _out_sgs = PT_REGS_PARM3(ctx);
    unsigned int _in_sgs = PT_REGS_PARM4(ctx);

Add some comments in code, and I'm not sure if I hit a bug....
Two cases I have tested:
1, get out_sgs from args and PT_REGS_PARMx, and generate perf event. there is a probability that getting different value.
2, building a kmod and also get different value.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/uaccess.h>
#include <linux/scatterlist.h>

static int kp_virtqueue_add_sgs_handler(struct kprobe *p, struct pt_regs *regs)
{
	struct scatterlist *sg = (struct scatterlist *)regs->si;
	int out = (int)regs->dx;
	int in = (int)regs->cx;

	printk("sg %p, out %d, in %d\n", sg, out, in);

	return 0;
}

static struct kprobe kp = {
	.symbol_name	= "virtqueue_add_sgs",
	.pre_handler = kp_virtqueue_add_sgs_handler,
};

static int __init kprobe_init(void)
{
	int ret;

	ret = register_kprobe(&kp);
	if (ret < 0) {
		pr_err("register_kprobe failed, returned %d\n", ret);
		return ret;
	}

	pr_info("[probe-unlink]Planted kprobe at %p\n", kp.addr);
	return 0;
}

static void __exit kprobe_exit(void)
{
	unregister_kprobe(&kp);
	pr_info("[probe-unregistered]kprobe at %p unregistered\n", kp.addr);
}

module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("pizhenwei pizhenwei@bytedance.com");

@yonghong-song
Copy link
Collaborator

@pizhenwei thanks for reporting. Indeed, the incorrect param value is caused by a bug in bcc rewriter. I have just put a fix #3275.

When I try to run, I hit the following verifier bug:

-bash-4.4$ sudo ./virtiostat.py 
bpf: Argument list too long. Program  too large (402 insns), at most 4096 insns

Traceback (most recent call last):
  File "./virtiostat.py", line 184, in <module>
    b.attach_kprobe(event="virtqueue_add_sgs", fn_name="trace_virtqueue_add_sgs")
  File "/usr/lib/python2.7/site-packages/bcc/__init__.py", line 679, in attach_kprobe
    fn = self.load_func(fn_name, BPF.KPROBE)
  File "/usr/lib/python2.7/site-packages/bcc/__init__.py", line 412, in load_func
    (func_name, errstr))
Exception: Failed to load BPF program trace_virtqueue_add_sgs: Argument list too long
-bash-4.4$

I am using llvm12 and llvm13(trunk) and both have issues. Let me dig more on this. I suspect this is related to the following code

    for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
        for (n = 0, sgp = sgs[i]; sgp && (n < SG_MAX); sgp = __sg_next(sgp)) {
            length += sgp->length;
            n++;
        }
    }

where verifier might not handle effectively.

@yonghong-song
Copy link
Collaborator

I did some analysis on verifier failure. The reason is due to the following code:

; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
  90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
  91:   bf 81 00 00 00 00 00 00 r1 = r8
  92:   07 01 00 00 ff ff ff ff r1 += -1
  93:   67 01 00 00 20 00 00 00 r1 <<= 32
  94:   77 01 00 00 20 00 00 00 r1 >>= 32
  95:   b7 02 00 00 05 00 00 00 r2 = 5
  96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
  97:   b7 08 00 00 06 00 00 00 r8 = 6
  98:   b7 02 00 00 00 00 00 00 r2 = 0
  99:   b7 09 00 00 00 00 00 00 r9 = 0
 100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
 101:   05 00 35 00 00 00 00 00 goto +53

Insn 100 saves r8, but r8 has conservative value at insn 91 (suppose insn 96 evaluates true) compared to r1. Such a conservative value will make verifier think the loop can execute up to u64_max which will cause verification failure.

We do have a pass in llvm12/13 intended to prevent such optimization. I will study more with llvm on how to prevent this transformation. Looks like you are using an early version of llvm which didn't do the above transformation..

@brendangregg
Copy link
Member

Cool tool, thanks!

The man page is missing a description of the columns.

A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

@pizhenwei
Copy link
Contributor Author

I did some analysis on verifier failure. The reason is due to the following code:

; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
  90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
  91:   bf 81 00 00 00 00 00 00 r1 = r8
  92:   07 01 00 00 ff ff ff ff r1 += -1
  93:   67 01 00 00 20 00 00 00 r1 <<= 32
  94:   77 01 00 00 20 00 00 00 r1 >>= 32
  95:   b7 02 00 00 05 00 00 00 r2 = 5
  96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
  97:   b7 08 00 00 06 00 00 00 r8 = 6
  98:   b7 02 00 00 00 00 00 00 r2 = 0
  99:   b7 09 00 00 00 00 00 00 r9 = 0
 100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
 101:   05 00 35 00 00 00 00 00 goto +53

Insn 100 saves r8, but r8 has conservative value at insn 91 (suppose insn 96 evaluates true) compared to r1. Such a conservative value will make verifier think the loop can execute up to u64_max which will cause verification failure.

We do have a pass in llvm12/13 intended to prevent such optimization. I will study more with llvm on how to prevent this transformation. Looks like you are using an early version of llvm which didn't do the above transformation..

I tested this on llvm-3.8, it worked fine. And tested on llvm-12, also reproduced this verification failure.

}
}

int trace_virtqueue_add_sgs(struct pt_regs *ctx, struct virtqueue *_vq,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change _vq to vq, so argument, so it can be consistent with rest of arguments.

{
/* NOTICE: to make sure your bcc is built with commit
b2317868af8c6b81a5b5065237589743f7a1168d */
record(_vq, sgs, out_sgs, in_sgs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_vq => vq


{
/* NOTICE: to make sure your bcc is built with commit
b2317868af8c6b81a5b5065237589743f7a1168d */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the above comment. This commit will be merged on top of the above commit, so no need to mention that.

@yonghong-song
Copy link
Collaborator

@pizhenwei please also incorporate the following changes in your next revision. llvm and kernel work to address these issues are ongoing. Thanks!

diff --git a/tools/virtiostat.py b/tools/virtiostat.py
index a6a38bd1..1b5fbccf 100755
--- a/tools/virtiostat.py
+++ b/tools/virtiostat.py
@@ -95,6 +95,28 @@ static u64 count_len(struct scatterlist **sgs, unsigned int num)
             length += sgp->length;
             n++;
         }
+
+        /* IndVarSimplifyPass with clang 12 may cause verifier failure:
+         *   ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { // Line  60
+         *   90:   15 08 15 00 00 00 00 00 if r8 == 0 goto +21
+         *   91:   bf 81 00 00 00 00 00 00 r1 = r8
+         *   92:   07 01 00 00 ff ff ff ff r1 += -1
+         *   93:   67 01 00 00 20 00 00 00 r1 <<= 32
+         *   94:   77 01 00 00 20 00 00 00 r1 >>= 32
+         *   95:   b7 02 00 00 05 00 00 00 r2 = 5
+         *   96:   2d 12 01 00 00 00 00 00 if r2 > r1 goto +1
+         *   97:   b7 08 00 00 06 00 00 00 r8 = 6
+         *   98:   b7 02 00 00 00 00 00 00 r2 = 0
+         *   99:   b7 09 00 00 00 00 00 00 r9 = 0
+         *  100:   7b 8a 68 ff 00 00 00 00 *(u64 *)(r10 - 152) = r8
+         *  101:   05 00 35 00 00 00 00 00 goto +53
+         * Note that r1 is refined by r8 is saved to stack for later use.
+         * This will give verifier u64_max loop bound and eventually cause
+         * verification failure. Workaround with the below asm code.
+         */
+#if __clang_major__ >= 7
+        asm volatile("" : "=r"(i) : "0"(i));
+#endif
     }
 
     return length;
@@ -107,6 +129,13 @@ static void record(struct virtqueue *vq, struct scatterlist **sgs,
     virtio_stat_t *vs;
     u64 key = (u64)vq;
+    /* Workaround: separate two count_len() calls, one here and the
+     * other below. Otherwise, compiler may generate some spills which
+     * harms verifier pruning. This happens in llvm12, but not llvm4.
+     * Below code works on both cases.
+     */
+    u64 in_bw = count_len(sgs, in_sgs);
+
     vs = stats.lookup(&key);
     if (!vs) {
         bpf_probe_read_kernel(newvs.driver, sizeof(newvs.driver), vq->vdev->dev.driver->name);
@@ -114,13 +143,13 @@ static void record(struct virtqueue *vq, struct scatterlist **sgs,
         bpf_probe_read_kernel(newvs.vqname, sizeof(newvs.vqname), vq->name);
         newvs.in_sgs = in_sgs;
         newvs.out_sgs = out_sgs;
-        newvs.in_bw = count_len(sgs, in_sgs);
+        newvs.in_bw = in_bw;
         newvs.out_bw = count_len(sgs, out_sgs);
         stats.update(&key, &newvs);
     } else {
         vs->in_sgs += in_sgs;
         vs->out_sgs += out_sgs;
-        vs->in_bw += count_len(sgs, in_sgs);
+        vs->in_bw += in_bw;
         vs->out_bw += count_len(sgs, out_sgs);
     }
 }

@yonghong-song
Copy link
Collaborator

[buildbot, ok to test]

@yonghong-song
Copy link
Collaborator

Please consider to address the following comments from @brendangregg

  • The man page is missing a description of the columns.
  • A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

@brendangregg
Copy link
Member

I tried using it, and as an iostat-like tool I found the options weren't what I expected. Instead of:

virtiostat
virtiostat -i 2
virtiostat -i 2 -d 10

I'd consider:

virtiostat
virtiostat 2
virtiostat 2 5

ie

virtiostat [interval [count]]

for consistency with the other stat tools.

Generally I'd reserve the tool arguments for the most common use case, and then options (like -T) for everything else. That saves typing. I think the most common use case here would be setting the interval and count, so we can drop the -i and -d. (note that this switches a duration to a count.)

@pizhenwei
Copy link
Contributor Author

I tried using it, and as an iostat-like tool I found the options weren't what I expected. Instead of:

virtiostat
virtiostat -i 2
virtiostat -i 2 -d 10

I'd consider:

virtiostat
virtiostat 2
virtiostat 2 5

ie

virtiostat [interval [count]]

for consistency with the other stat tools.

Generally I'd reserve the tool arguments for the most common use case, and then options (like -T) for everything else. That saves typing. I think the most common use case here would be setting the interval and count, so we can drop the -i and -d. (note that this switches a duration to a count.)

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

@pizhenwei
Copy link
Contributor Author

Please consider to address the following comments from @brendangregg

  • The man page is missing a description of the columns.

Add some comment in description about SGs and BW.

  • A later addition could be a -x mode for extended stats, to include average latency (similar to iostat).

Let's make sure the basic version work fine firstly. To implement -x mode needs record the request starting time and done time (to track another function). So in my plan, I'd like to implement this in another PR.

if (out_sgs)
out_bw = count_len(sgs, out_sgs);
if (in_sgs)
in_bw = count_len(sgs + out_sgs, in_sgs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.

Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code failed with llvm12 in my environment. Could you try my original suggestion in your environment?
if it works, let us use that. We can revisit once we got some fix in llvm or verifier.

I tested this with LLVM version 12.0.0, it worked fine .... But I still modify it as your suggestion.

LLVM (http://llvm.org/):
  LLVM version 12.0.0
  
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: skylake

Also, it looks count_len for in_bw is different from previous version. I checked the kernel source code. This version seems correct.

Yes, this version is correct.

@brendangregg
Copy link
Member

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

From https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md:

Consider command line options. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]].

Other tools use -i -d when there is a different common case. E.g., funccount(8), where the common case is the function you want to trace, leaving interval/duration for the -i -d switches.

@brendangregg
Copy link
Member

BTW, command line tools, on both Unix and Windows, frequently use this convention. For example, this is not their usage:

ls -l -file myfile.txt
cd -d c:\Windows

They have dropped the switch for the common case.

Add a new tool virtiostat to trace VIRTIO devices IO statistics.

Although we have already had iostat(to show block device statistics),
iftop(to show network device statistics), other devices of VIRTIO
family(Ex, console, balloon, GPU and so on) also need tools for each
type. virtiostat works in virtio lower layer, and it could trace all
the devices.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@pizhenwei
Copy link
Contributor Author

I ran grep "parser.add_argument" *.py | grep grep -E "interval|duration" in tools directory, and noticed that a lot of tools use -i/-d.
Making virtiostat look like iostat may break the hobby of the bcc tools.

From https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md:

Consider command line options. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]].

Other tools use -i -d when there is a different common case. E.g., funccount(8), where the common case is the function you want to trace, leaving interval/duration for the -i -d switches.

Force pushed as you suggested. Thanks a lot.

usage: virtiostat.py [-h] [-T] [-D] [interval] [count]

@yonghong-song
Copy link
Collaborator

LGTM. Thanks!

@yonghong-song yonghong-song merged commit 8614895 into iovisor:master Feb 22, 2021
@brendangregg
Copy link
Member

Thanks!

yonghong-song added a commit to llvm/llvm-project that referenced this pull request Feb 25, 2021
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  torvalds#10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  torvalds#10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The orignal bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 405099 insns (limit 1000000) max_states_per_insn 92
      total_states 8866 peak_states 889 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Feb 26, 2021
The original bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 26, 2021
The original bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  torvalds#10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  torvalds#10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Mar 3, 2021
The original bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Mar 4, 2021
The original bcc pull request
  iovisor/bcc#3270
exposed a verifier failure with Clang 12/13 while
Clang 4 works fine. Further investigation exposed two issues.
  Issue 1: LLVM may generate code which uses less refined
     value. The issue is fixed in llvm patch
     https://reviews.llvm.org/D97479
  Issue 2: Spills with initial value 0 are marked as precise
     which makes later state pruning less effective.
     This is my rough initial analysis and further investigation
     is needed to find how to improve verifier pruning
     in such cases.

With the above llvm patch, for the new loop6.c test, which has
smaller loop bound compared to original test, I got
  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND",
I got
  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression
test for the above llvm fix and also provide a test
case for further analyzing the verifier pruning issue.

Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
kernel-patches-bot pushed a commit to kernel-patches/bpf that referenced this pull request Mar 4, 2021
The original bcc pull request iovisor/bcc#3270 exposed
a verifier failure with Clang 12/13 while Clang 4 works fine.

Further investigation exposed two issues:

  Issue 1: LLVM may generate code which uses less refined value. The issue is
           fixed in LLVM patch: https://reviews.llvm.org/D97479

  Issue 2: Spills with initial value 0 are marked as precise which makes later
           state pruning less effective. This is my rough initial analysis and
           further investigation is needed to find how to improve verifier
           pruning in such cases.

With the above LLVM patch, for the new loop6.c test, which has smaller loop
bound compared to original test, I got:

  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got:

  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression test for the above LLVM fix
and also provide a test case for further analyzing the verifier pruning issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zhenwei Pi <pizhenwei@bytedance.com>
Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com
morehouse pushed a commit to morehouse/llvm-project that referenced this pull request Mar 4, 2021
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn llvm#15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn llvm#24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479
LorenzoBianconi pushed a commit to LorenzoBianconi/bpf-next that referenced this pull request Mar 8, 2021
The original bcc pull request iovisor/bcc#3270 exposed
a verifier failure with Clang 12/13 while Clang 4 works fine.

Further investigation exposed two issues:

  Issue 1: LLVM may generate code which uses less refined value. The issue is
           fixed in LLVM patch: https://reviews.llvm.org/D97479

  Issue 2: Spills with initial value 0 are marked as precise which makes later
           state pruning less effective. This is my rough initial analysis and
           further investigation is needed to find how to improve verifier
           pruning in such cases.

With the above LLVM patch, for the new loop6.c test, which has smaller loop
bound compared to original test, I got:

  $ test_progs -s -n 10/16
  ...
  stack depth 64
  processed 390735 insns (limit 1000000) max_states_per_insn 87
      total_states 8658 peak_states 964 mark_read 6
  #10/16 loop6.o:OK

Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got:

  $ test_progs -s -n 10/16
  ...
  BPF program is too large. Processed 1000001 insn
  stack depth 64
  processed 1000001 insns (limit 1000000) max_states_per_insn 91
      total_states 23176 peak_states 5069 mark_read 6
  ...
  #10/16 loop6.o:FAIL

The purpose of this patch is to provide a regression test for the above LLVM fix
and also provide a test case for further analyzing the verifier pruning issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zhenwei Pi <pizhenwei@bytedance.com>
Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com
tstellar pushed a commit to llvm/llvm-project that referenced this pull request May 3, 2021
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479

(cherry picked from commit 1959ead)
s194604 pushed a commit to s194604/llvm-playground that referenced this pull request Mar 6, 2022
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479

(cherry picked from commit 1959ead)
bryanpkc pushed a commit to Huawei-CPLLab/classic-flang-llvm-project that referenced this pull request Mar 12, 2022
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn flang-compiler#15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn flang-compiler#24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479

(cherry picked from commit 1959ead)
bryanpkc pushed a commit to flang-compiler/classic-flang-llvm-project that referenced this pull request Apr 20, 2022
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479

(cherry picked from commit 1959ead)
ajohnson-uoregon pushed a commit to ajohnson-uoregon/clang-rewrite-only that referenced this pull request Jul 17, 2022
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479

(cherry picked from commit 0bc0084)
@pizhenwei pizhenwei deleted the virtiostat branch September 29, 2022 03:11
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this pull request Oct 7, 2022
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants