[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

KLozes · 2021-04-06T05:18:19Z

Related issue = #2177

I think this is pretty important PR. It opens the door for much more complicated structures!

This PR addresses the peculiar behavior seen in listgen and accessing fields with more complex multilevel structs, see the issue above. As @k-ye discovered (thanks!), it is because we assume the first child of an snode when we compute the 'start' member of the extractors.

So I've done away with the 'start' member entirely, and instead they are now calculated on-the-fly where they are needed in demote_struct_for and lower_access IR passes. This was necessary because the list of start values is unique to the leaf-to-root path of the filed being accessed.

I also was able to get rid of the trailing_bits member, since every field can now be accessed independently of the shape of other fields. Which I think is more intuitive.

This PR also changes the definition of indices for parents of leaf snodes. for example, the structure

root.dense(ti.i, 4).dense(ti.i, 4).place(x)

used to have the indices below. I believe 0,1,2 or 3 could be used the access the first element of dense0.

dense1:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

dense0:  0         4         8          12

root:    0

with this PR, dense0 has unique indices. see the changes to test_sparse_deactivate.py

dense1:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

dense0:  0         1         2           3

root:    0

If someone can figure out a fix for bls that would be really helpful. My attempt did not work, but I think it is on the right track.

[Click here for the format server]

…ltilevel

k-ye · 2021-04-07T10:07:03Z

Hi @KLozes ,

Thanks for proposing this change! It's not a trivial change so I will list my understanding and thoughts here.

The trailing bits feature was introduced back in #1558. It adds two things:

It requires that the shapes of all the children at a given dimension (index) are the same. This invalidates such kind of usage:

taichi/tests/python/test_trailing_bits.py

Lines 47 to 50 in 0035ec0

    
           # Here the numbers of bits of x and z are inconsistent, 
        
           # which leads to the RuntimeError below. 
        
           block.dense(ti.i, 32).place(x) 
        
           block.dense(ti.i, 16).place(z)

Shapes from other indices of other children are "attached" to a child SNode. Effectively, the other children's shapes span the space of a given child. This is done in

taichi/taichi/struct/struct.cpp

Lines 152 to 163 in 0035ec0

    
           std::function<void(SNode &)> top_down = [&](SNode &s) { 
        
             for (auto &c : s.ch) { 
        
               if (s.type != SNodeType::root) 
        
                 for (int i = 0; i < taichi_max_num_indices; i++) { 
        
                   c->extractors[i].trailing_bits = 
        
                       s.extractors[i].trailing_bits - c->extractors[i].num_bits; 
        
                 } 
        
               top_down(*c); 
        
             } 
        
           }; 
        
           top_down(snode);

. Say if we have

blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4)

The i-th dimension of a gets attached to b, so b's shape is (2x4, 4) = (8, 4).

I can see how these would result in some confusion, such as #1682 . @yuanming-hu probably knows better about the use case.

it is because we assume the first child of an snode when we compute the 'start' member of the extractors.

Right. But this was a bug where Taichi didn't properly enforce 1. I.e., if Taichi fixed this bug, the example in #2177 would result in Taichi throwing a "Inconsistent trailing bit configuration" error. As you've pointed out, whether to keep this feature is another question.

I also was able to get rid of the trailing_bits member, since every field can now be accessed independently of the shape of other fields.

That's the question to discuss...

This PR also changes the definition of indices during for parents of leaf snodes.

Right, this does seem more intuitive. But one property Taichi needs to keep is that, when accessing some intermediary SNode with the leaf SNode's index, it will go to the same ancestor (if the leaf SNodes are from the same ancestor). In your example, dense0[0], dense0[1], dense0[2], dense0[3] will all point to the same cell. (We can see if there is a way to remove the necessity of this property to make the implementation more intuitive)

Appendix: Some notes on IndexExtractor for the future me... This struct records the information about a particular dimension (index).

num_bits: Number of bits required to store the index. E.g. if a SNode is of shape (ti.ij, (4, 8)). Then it needs 2 bits for the i-th dimension and 3 bits for the j-th.
num_elements: Max number of elements (not padded to the power-of-two)

start: This is the starting bit index of the snode for a coordinate value at the given dimension . It characterizes the relationship between a parent and a child snode. parent.start = child.start + child.num_bits. For example, if there are two 1D snodes a and b, where b = a.dense(ti.i, 8), we have a.start = b.start + 3 for the i-th dimension. When accessing b[15], Taichi first goes through a with (15 >> 3) = 1.

taichi/taichi/struct/struct.cpp

Lines 65 to 67 in 0035ec0

    
           int inferred = ch->extractors[i].start + ch->extractors[i].num_bits; 
        
           if (ch_id == 0) { 
        
             snode.extractors[i].start = inferred;

acc_offset: Accumulated offset. This is the starting bit index within a flattened loop index value. It describes the relationship of the bit offsets between different dimensions for a given snode. extractor[i].acc_offset = extractor[i + 1].acc_offset + num_bits. For example, for a 3D snode a of (ti.ijk, (2, 4, 8)) and a flattened loop index li, bits [0, 3) in li are for the k-th dim, bits [3, 5) are for the j-th, and bits [5, 6) the i-th.
https://github.com/taichi-dev/taichi/blob/0035ec061a000e77f8507e6240ed8be6c8d59eb8/taichi/struct/struct.cpp#L68-L78

trailing_bits: This is quite similar to start, i.e. parent.trailing_bits = child.trailing_bits + child.num_bits. However, after this bottom-up compute, it goes through another top-down pass.

taichi/taichi/struct/struct.cpp

Lines 152 to 163 in 0035ec0

    
           std::function<void(SNode &)> top_down = [&](SNode &s) { 
        
             for (auto &c : s.ch) { 
        
               if (s.type != SNodeType::root) 
        
                 for (int i = 0; i < taichi_max_num_indices; i++) { 
        
                   c->extractors[i].trailing_bits = 
        
                       s.extractors[i].trailing_bits - c->extractors[i].num_bits; 
        
                 } 
        
               top_down(*c); 
        
             } 
        
           }; 
        
           top_down(snode);

This is useful when a parent has multiple children that have used different dimensions. The effect can be seen from this test case:

taichi/tests/python/test_trailing_bits.py

Lines 8 to 16 in 0035ec0

    
           x = ti.field(ti.f32) 
        
           y = ti.field(ti.f32) 
        
           block = ti.root.pointer(ti.i, 8) 
        
           block.dense(ti.i, 32).place(x) 
        
           # Here every 32 ti.i share the same dense node of 16 y along ti.j. 
        
           block.dense(ti.j, 16).place(y) 
        
           assert y.shape == (256, 16)

KLozes · 2021-04-07T15:25:18Z

Hi @k-ye

Thanks for the discussion.

So my idea here is that trailing bits concept no longer exists. Every field will have the indices as if it were the only field placed in the whole program. So the structure in trailing bit test

    x = ti.field(ti.f32)
    z = ti.field(ti.f32)

    block = ti.root.pointer(ti.i, 8)
    block.dense(ti.i, 32).place(x)
    block.dense(ti.i, 16).place(z)

is now a valid data structure. The x field will have indices [0,1...255] and z will have indices [0,1,...127]. The key is to compute the 'start' value on-the-fly during access_lowering and demote_dense_struct_for, so that we can correctly access and loop over any field, no matter what fields are placed on the same tree. See lower_access.cpp in my first commit. 8e622d8 (Sorry, couldn't figure out how to get a nice viewable code link)

This data structure is also still valid

blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4)

and b will no longer inherit any dimensions from a. So a will be 1d with shape (8) . b will be 2d with shape (2,4). this is possible thanks the the on-the-fly 'start' value computation.

It was also possible to get rid of the 'start' value in listgen using block indices instead of the 'block corner leaf cell' like it was before. This just took a small change to the refine_coordinates function.


blk = root.pointer(ti.i, 4).
blk.dense(ti.i, 4).place(x)

old listgen

dense:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

pointer:  0         4         8          12

root:    0


new listgen

dense:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

pointer:  0         1         2           3

root:    0

As for accessing intermediate snodes. They can still be accessed by dividing by block size. So if we wanted to deactivate cells 8 through 11. We just want to deactivate the pointer with index 2. So we could just call blk.deactivate(9//4) or blk.deactivate(2).

You should be able to pull this PR and play around with it yourself. The only thing that isn't working yet is block local storage.

k-ye · 2021-04-09T15:59:33Z

Right, this PR basically reverts what #1558 has introduced. IIRC, it was added mainly to support this use case:

parent = # some SNode
parent.dense(ti.ijk, 8).place(blk)
parent.dynamic(ti.l, 256).place(lst)

for I in ti.grouped(blk):
  # Every 8x8x8 dense `blk` cells share the same dynamic `lst`. 
  # So we can use `I` to access the sibling `lst` cell.

This pattern is now used in the taichi MPM solvers. (TODO: Find a real example)

Without trailing bits, sharing the same lst cell among the 8x8x8 blk cells requires one to manually divide the index by (8, 8, 8), which is kind of repetitive. Ideally, we should only need to define the shape of the SNodes once in the program. In another word, this violates the design goal that the computation is decoupled from the data structure.

I think what we probably need here is some API to help achieve this need. For exaple, rescale_index(index, from_snode, to_snode): returns index * shape(to_snode) / shape(from_snode). For this API, it may need to have additional preconditions, e.g. one snode must be the ancestor of the other, etc. Hopefully this strikes a balance between keeping taichi's hierarchical data structure intuitive and satisfying the need of the MPM solvers.

As of this PR, because of the existing usage, it seems that a more reasonable approach would be to first think about what kind of API we need to satisfy the MPM solver's need, migrate the solvers to that API, and then deprecate the trailing bits.

xumingkuan

LGTM! We might also update the code in other repos such as https://github.com/taichi-dev/taichi_elements to make sure these codes are not broken.

xumingkuan · 2021-04-21T05:37:34Z

taichi/transforms/insert_scratch_pad.cpp

      int num_indices = (int)ptr->indices.size();
      for (int i = 0; i < num_indices; i++) {
        auto diff = irpass::analysis::value_diff_loop_index(ptr->indices[i],
                                                            for_stmt, i);
-        if (diff.linear_related()) {
+        if (diff.related_() and diff.coeff > 0) {


nit :)

Suggested change

if (diff.related_() and diff.coeff > 0) {

if (diff.related_() && diff.coeff > 0) {

k-ye

This is great work, thank you so much for the update! However, I still have to point out a few things:

We need to first add the index rescaling API, so that we don't manually compute that in the kernel. We will also need to check in this API to update other repos before accepting the changes here.
How do we handle the parent coordinates if the leaf SNode has offsets? Does it start from 0, or the rescaled offset?
Did we verify that the BLS still worked? I.e. the compiler can still be able to infer the BLS storage size and convert the global index <-> block index. I think one possible approach is to add some CPP unit tests around initialize_scratch_pad() and make_block_local(). Given that these two functions are so large, it will be great if the implementation of these two passes can be broken down into smaller, testable functions (possibly in a series of PRs)

k-ye · 2021-04-21T06:25:38Z

taichi/codegen/codegen_llvm.cpp

+        get_runtime_function(leaf_block->refine_coordinates_func_name());
+
+    create_call(refine, {parent_coordinates, block_corner_coordinates,
+                         tlctx->get_constant(0)});


Why is this 0?

The block corner has the same coordinates as the leaf element of the first (0th) loop_index . And it is the same for any loop_index. The refine function here basically just left shifts the parent coordinates by num_bits.

k-ye · 2021-04-21T06:33:03Z

taichi/codegen/codegen_llvm.cpp

    auto new_coordinates = create_entry_block_alloca(physical_coordinate_ty);

    create_call(refine, {parent_coordinates, new_coordinates,
                         builder->CreateLoad(loop_index)});

+    if (stmt->snode->type == SNodeType::bit_array && stmt->snode->parent) {


Could you explain why is this needed?

One more refine step was needed for bit_arrays to make the loop coordinates non-consecutive since the operation is going to be vectorized over multiple coordinates per loop_index.
For example, a 1-d block of 64 leaf elements and vectorization of 32, only two loop coordinates are needed. Without this extra refine step, their coordinates would be a consecutive 0, 1 instead of non-consecutive 0, 32.
This step was also not needed before since coordinates were always non-consecutive.

KLozes · 2021-04-22T14:53:16Z

This is great work, thank you so much for the update! However, I still have to point out a few things:

Thanks! I've learned a ton too!

We need to first add the index rescaling API, so that we don't manually compute that in the kernel. We will also need to check in this API to update other repos before accepting the changes here.

Alright, I'll start think about an API.

How do we handle the parent coordinates if the leaf SNode has offsets? Does it start from 0, or the rescaled offset?

I am not sure about this, but offset tests seem to be working fine. One just has to make sure to that offsets are applied to blocks correctly on the front end when necessary. see the changes to bls_test_template.py:
f1e5819#diff-2ea5f113062438816406fbdb85d44d0b44fde412eed818ea1c06c9584c887277R138-R154

Did we verify that the BLS still worked? I.e. the compiler can still be able to infer the BLS storage size and convert the global index <-> block index. I think one possible approach is to add some CPP unit tests around initialize_scratch_pad() and make_block_local(). Given that these two functions are so large, it will be great if the implementation of these two passes can be broken down into smaller, testable functions (possibly in a series of PRs)

I did verify that the tests in test_bls.py and test_bls_assume_in_range.py work, and they look pretty extensive. I also plan to add some more bls tests with the rescaling API

k-ye · 2021-04-24T14:31:08Z

I am not sure about this, but offset tests seem to be working fine. One just has to make sure to that offsets are applied to blocks correctly on the front end when necessary. see the changes to bls_test_template.py:
f1e5819#diff-2ea5f113062438816406fbdb85d44d0b44fde412eed818ea1c06c9584c887277R138-R154

Thanks! After taking another look, I'm not entirely sure if this is correct in the first place:

taichi/tests/python/bls_test_template.py

Lines 170 to 177 in 939e1d3

    
           for i in x: 
        
               # It is important to ensure insert and p2g uses the exact same way to compute the base 
        
               # coordinates. Otherwise there might be coordinate mismatch due to float-point errors. 
        
               base = ti.Vector([ 
        
                   int(ti.floor(x[i][0] * N) - grid_offset[0]), 
        
                   int(ti.floor(x[i][1] * N) - grid_offset[1]) 
        
               ]) 
        
               ti.append(pid.parent(), base, i)

The question is do we really need to subtract offset when computing base?

@xumingkuan and I had a discussion to go over how offsets are handled in Taichi right now. Here's the summary:

For a place SNode with offset, the offset is useful only when we iterate it using a struct-for.

When flattening a GlobalPtrExpression (.e.g. x[i, j]), the indices are first subtracted by offsets:

taichi/taichi/ir/frontend_ir.cpp

Lines 196 to 207 in f34203f

    
           if (bool(var) && var.is<GlobalVariableExpression>()) { 
        
             snode = var.cast<GlobalVariableExpression>()->snode; 
        
             offsets = snode->index_offsets; 
        
           } 
        
           for (int i = 0; i < (int)indices.size(); i++) { 
        
             indices.exprs[i]->flatten(ctx); 
        
             Stmt *ind = indices.exprs[i]->stmt; 
        
             if (!offsets.empty()) { 
        
               // Subtract offsets from indices so that new indices are 
        
               // within [0, +inf) 
        
               auto offset = ctx->push_back<ConstStmt>(TypedConstant(offsets[i])); 
        
               ind = ctx->push_back<BinaryOpStmt>(BinaryOpType::sub, ind, offset);

Later, the loop index of the struct-for is transformed into BinOp(+, loop_index, offset) in lower_ast():

taichi/taichi/transforms/lower_ast.cpp

Lines 302 to 306 in f34203f

    
           if ((int)offsets.size() > i && offsets[i] != 0) { 
        
             auto offset_const = 
        
                 new_statements.push_back<ConstStmt>(TypedConstant(offsets[i])); 
        
             auto result = new_statements.push_back<BinaryOpStmt>( 
        
                 BinaryOpType::add, loop_index, offset_const);

A concrete example:

x = ...place(..., offset=(-16,))  # ``x`` has an offset of -16
y = ...place(...)  # ``y`` has no offset

for I in x:
  x[I]
  y[I]

GlobalPtrExpression::flatten() --->

for I in x:
  x[I - (-16)]  # ``x`` has an offset of -16, so its GlobalPtrExpression's indices are subtracted by (-16)
  y[I]  # ``y`` does not have any offset, so its GlobalPtrExpression's indices are not offseted

lower_ast() --->

for (I - 16) in x:  # ``I`` still starts from 0
  x[(I - 16) - (-16)]  # i.e., x[I]
  y[I - 16]

So, when we are accessing a leaf SNode with offset, the accessing indices should still begin from 0. It's just that the compiler replaces the loop index I statement with a binary op of I + offset.

Therefore I don't think it makes sense to subtract offset in that test case. To further support this point, consider this simpler example:

import taichi as ti

ti.init(arch=ti.cpu, print_ir=True)

y = ti.field(ti.i32)

offset = (-2,)

grid = ti.root.dense(ti.i, 4)
dyn = grid.dynamic(ti.j, 4)
dyn.place(y, offset=offset +(0,))

@ti.kernel
def foo():
  ti.append(dyn, [-2], 0)

foo()

The final IR is shown as below:

kernel {
  $0 = offloaded  
  body {
    <i32> $1 = const [0]
    <*gen> $2 = get root
    <*gen> $3 = [S0root][root]::lookup($2, $1) activate = false
    <*gen> $4 = get child [S0root->S1dense] $3
    <i32> $5 = const [2]
    <*gen> $6 = [S1dense][dense]::lookup($4, $5) activate = false
    <*gen> $7 = get child [S1dense->S2dynamic] $6
    <i32> $8 = append [S2dynamic] ptr = $7, val = $1
  }
}

$5 = const [2] is what [-2] gets translated to, which should have been a const of 0 (the offset is -2). Moreover, different offset here actually doesn't make a difference, and the const 2 is simply derived from -2 & 3:

    <i32> $1 = const [-2]
    <i32> $2 = const [0]
    <i32> $3 = shuffle $1[0]  # same as $1
    ...
    <i32> $8 = const [0]
    <i32> $9 = bit_sar $3 $8  # same as $1
    <i32> $10 = const [3]
    <i32> $11 = bit_and $9 $10  <-- equiv. to ($1 & $10) = (-2 & 3) = 2

I did verify that the tests in test_bls.py and test_bls_assume_in_range.py work, and they look pretty extensive.

Unfortunately, I don't think that these tests passing would mean anything (sorry I didn't mean to blame). IIUC, if the compiler cannot deduce the BLS size, it would simplify not use the shared memory. So the test would suffer a slower performance, but there won't be any correctness issue. Therefore a better approach would be to test the BLS-related passes in the C++ layer. Let me think about how to refactor the code so that we can achieve that :-)

k-ye · 2021-04-26T01:05:38Z

Also, could you break down this PR into several smaller ones, so that we can make progress much quicker :-) ? E.g.

rescale_index API
changes to deal with DiffRange multiplication and scratch pad initialisation
changes to remove start
changes to remove the trailing bits (might need further breaking down)

KLozes · 2021-04-26T04:39:06Z

Hi @k-ye

$5 = const [2] is what [-2] gets translated to, which should have been a const of 0 (the offset is -2). Moreover, different offset here actually doesn't make a difference, and the const 2 is simply derived from -2 & 3

I see what you mean! The grid_offset definitely should not be required to make this work. this seems to be a bug with how offsets are handled with flattening the append function call.

Unfortunately, I don't think that these tests passing would mean anything (sorry I didn't mean to blame). IIUC, if the compiler cannot deduce the BLS size, it would simplify not use the shared memory. So the test would suffer a slower performance, but there won't be any correctness issue. Therefore a better approach would be to test the BLS-related passes in the C++ layer. Let me think about how to refactor the code so that we can achieve that :-)

Ahh. Thanks for picking up this task!

KLozes · 2021-04-26T04:45:38Z

Also, could you break down this PR into several smaller ones, so that we can make progress much quicker? E.g.
1. rescale_index API
2. changes to deal with DiffRange multiplication and scratch pad initialisation
3. changes to remove start
4. changes to remove the trailing bits (might need further breaking down)

Sure! I can try.

I think it will be best to keep 2 and 3 together. Since the diffrange multiplication was needed to make bls tests work after changing indices to consecutive. 1 and 4 also go together. I actually put trailing bits back in. Since attaching trailing bits to the "place" snode is a good way to implement the index re-scaling.

So I can split this PR into 2.

Consecutive indices for all snodes
API for index re-scaling

k-ye · 2021-04-26T06:01:22Z

I think it will be best to keep 2 and 3 together. Since the diffrange multiplication was needed to make bls tests work after changing indices to consecutive.

Hmm, I thought that "3. remove start" was purely a cleanup that is not strongly tied to other changes here. Maybe I missed something. Still, in general I prefer breaking down a big PR containing a series of related changes into individual ones. It should save both of us some efforts to discuss back and forth :-)

I think this PR serves as a great baseline to demonstrate our end goal. What we need is just to figure out the path to upstream it piece by piece, without breaking the existing usages or to introduce new bugs. Personally I find these git tips useful:

1 and 4 also go together. I actually put trailing bits back in. Since attaching trailing bits to the "place" snode is a good way to implement the index re-scaling.

Ah right! I used to think that we need to follow these steps:

add the rescaling index API to taichi
migrate the existing code using the trailing bits feature to this new API
remove the trailing bits from taichi

But so long as the trailing bits are not removed, using the API will produce the wrong result..

KLozes · 2021-04-26T15:33:47Z

Hmm, I thought that "3. remove start" was purely a cleanup that is not strongly tied to other changes here

Yes, I suppose actually remove the 'start' property from the extractor was just cleanup. Since I now compute 'start' on-the-fly when it is needed. So it is no longer a property of an snode.

Ah right! I used to think that we need to follow these steps:

add the rescaling index API to taichi

migrate the existing code using the trailing bits feature to this new API

remove the trailing bits from taichi

But so long as the trailing bits are not removed, using the API will produce the wrong result.

So, I should be more clear here. I'm using trailing bits to implement this API on the backend. I have totally replaced the old way of computing trailing bits. Now trailing bits are attached to the place snode to account for the difference in bits between an snode and its assigned "rescaling_snode". See my new commit. commit abbc4b9
Maybe I should rename trailing_bits to rescaling_bits?

This API doesn't quite work yet. I believe lower_access and alias analysis work.

I still have some work to do on on demote_dense_struct_for and listgen. Currently, the StructForStmt and OffloadStmt classes do not have access to the leaf (place) snode to grab the trailing_bits from. The 'snode' member is actually the leaf_block. So I added a new 'leaf_snode' member to them. But for some reason this causes async tests to fail. Any idea why??

KLozes · 2021-04-26T15:43:37Z

I think this PR serves as a great baseline to demonstrate our end goal. What we need is just to figure out the path to upstream it piece by piece, without breaking the existing usages or to introduce new bugs. Personally I find these git tips useful:

Thanks!

I think I can split it into 3 fresh PRs. But they need to pulled in this order

diffrange multiplication
consecutive indices.
rescaling API

k-ye · 2021-04-27T09:42:03Z

Thanks for the update!

I'm using trailing bits to implement this API on the backend. I have totally replaced the old way of computing trailing bits. Now trailing bits are attached to the place snode to account for the difference in bits between an snode and its assigned "rescaling_snode".

Hmm, doesn't this work against the original goal to drop the trailing bits feature completely? In my (maybe oversimplified) thought, rescale_index(from_snode, to_snode, index) can simply look at the shape of the two snodes, then divide one by the other to get the rescaling coefficient. (We may need additional preconditions on this method, such as the indices used of these two SNodes must be the same, i.e., both using ti.ijk. They must be siblings, etc)

KLozes · 2021-04-27T15:31:43Z

@k-ye
Thanks for the discussion!

Hmm, doesn't this work against the original goal to drop the trailing bits feature completely?

So yes, I want to drop automatically computing trailing bits based on sibling snode shapes. Instead, trailing bits would be non-zero only when the rescaled API is called. So with this PR, the rescale_indices_to() function

a = ti.field(ti.f32)
b = ti.field(ti.f32)
c = ti.field(ti.f32)
d = ti.field(ti.f32)

blk = ti.root.pointer(ti.ij, 128)
blk.dense(ti.ij, 4).place(a)
blk.dense(ti.ij, 4).place(b)
blk.dynamic(ti.l, 128).place(c)
blk.dynamic(ti.l, 128).place(d)

c.rescale_indices_to(a)
d.rescale_indices_to(b)

would give the place(c) and place(d) trailing bits to rescale their ij indices from [0,1,2...,128) to [0,4,8,...512). Which would lead to the same behavior we see in Taichi before this PR. All struct-fors and accesses of c and d will use indices [0,4,8, ...512).

My issue with the per-kernel rescaling that you proposed in another comment is that you can only rescale one field in each kernel. I think would cause problems if there are two fields that need to be rescaled, like the example above.

In my (maybe oversimplified) thought, rescale_index(from_snode, to_snode, index) can simply look at the shape of the two snodes, then divide one by the other to get the rescaling coefficient.

Yep! This basically how I'm computing the trailing bits when this API is called. The trailing bits are just the difference in the total number of bits from root to leaf. The two snodes actually don't even have to be siblings or have the same ijkl indices.

yuanming-hu · 2021-04-28T06:17:09Z

Hi @KLozes, thank you for being an awesome Taichi contributor! We really appreciate your contribution. We have a Taichi cup souvenir for every Taichi developer. Given that now we have a company around Taichi, we are finally able to ship the souvenir across the world (previously only within mainland China, unfortunately, due to COVID and lack of ￥). If you don't mind, could you share your address with yuanming@taichi.graphics and linan@taichi.graphics? We will then FedEx the cup to you :-)

(Sent the message on LinkedIn too a few days ago. You can reply there if you'd like to.)

(Sorry about the off-topic post here!)

KLozes · 2021-04-28T06:26:53Z

Hi @yuanming-hu

No problem! I've learned a ton and had fun contributing to Taichi! Thanks, I'd love to have one of those souvenirs :). Sorry about missing your message on LinkedIn!

k-ye · 2021-04-28T08:56:37Z

Thanks! Maybe it's better to list what we want to fix (#2248 (comment)):

Remove the trailing bits, otherwise it complicates how a SNode's shape is computed
Make indexing consecutive.

would give the place(c) and place(d) trailing bits to rescale their ij indices from [0,1,2...,128) to [0,4,8,...512).

This means the field index is still non-consecutive, so taichi still has to keep the implementation to support that :-(

My issue with the per-kernel rescaling that you proposed in another comment is that you can only rescale one field in each kernel.

Hmm, I guess you mean that we can keep the field index rescaled in all the kernels iterating over it? We should be able to rescale the index for any number of fields inside one kernel :-)

for I in a:
  Ib = ti.rescale_index(a, b, I) # Ib = I // 4
  # use `lb` to access `b`
  Ic = ti.rescale_index(a, c, I)  # Ic = I // 4
  # use `lc` to access `c`

Also, the fact that we had to introduce a rescale_snode in SNode itself means we're still paying the price of changing a pretty generic data structure to support a narrow use case. As mentioned before, I hope that we can agree that this complexity should be scoped into a special API, rather than making it part of the SNode data structure.

KLozes · 2021-04-29T04:06:09Z

@k-ye

Ahh, I was thinking we wanted an API to optionally maintain the old non-consecutive index behavior. And that can only be done with trailing bits. But if you don't think that is necessary, I can certainly implement this (hopefully) simpler API! Looks like it should only require python code changes too.

k-ye · 2021-04-29T04:35:44Z

I can certainly implement this (hopefully) simpler API! Looks like it should only require python code changes too.

Sounds great, thank you!!

KLozes · 2021-05-18T01:25:08Z

implemented in #2327

Kenneth Lozes added 9 commits April 2, 2021 23:04

more robust listgen and accessors

8e622d8

fix tests

dc21eee

Merge branch 'master' of https://github.com/taichi-dev/taichi into mu…

0762e89

…ltilevel

get rid of unecessary if

1c30b27

simplify infer_snode_properties

ec6e5e0

add second refine for bit_arrays

e3fd3a5

remove start and trailing_bits from extractors

4740d22

try fix block local storage

0fbad4e

merge with upstream

970dd13

KLozes added help wanted error-prone This PR may introduce potential bug if not carefully reviewed & tested labels Apr 6, 2021

KLozes requested review from yuanming-hu and k-ye April 6, 2021 05:18

taichi-gardener and others added 3 commits April 6, 2021 04:52

[skip ci] enforce code format

4144fb7

fix compile error after merge

d776998

Merge branch 'multilevel' of https://github.com/KLozes/taichi into mu…

b11a10d

…ltilevel

KLozes changed the title ~~[Bug] [LLVM] [IR] [Lang] complex multilevel structs DRAFT~~ [Bug][LLVM][IR][Lang] complex multilevel structs DRAFT Apr 6, 2021

KLozes changed the title ~~[Bug][LLVM][IR][Lang] complex multilevel structs DRAFT~~ [Bug] [llvm] [ir] complex multilevel structs DRAFT Apr 6, 2021

try fix metal compile error

8dd8769

KLozes changed the title ~~[Bug] [llvm] [ir] complex multilevel structs DRAFT~~ [Bug] [llvm] [ir] Complex multilevel structs DRAFT Apr 6, 2021

add tests

1692ffe

Kenneth Lozes and others added 4 commits April 11, 2021 18:08

fix block corner stmt and fix bls tests

b985fcf

merge with upstream master

e068c5b

remove print

ca72cd0

[skip ci] enforce code format

1f8873f

KLozes removed error-prone This PR may introduce potential bug if not carefully reviewed & tested help wanted labels Apr 12, 2021

xumingkuan approved these changes Apr 21, 2021

View reviewed changes

k-ye reviewed Apr 21, 2021

View reviewed changes

k-ye mentioned this pull request Apr 25, 2021

[ir] Make BLSAnalyzer testable #2294

Merged

KLozes changed the title ~~[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes~~ [Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft Apr 26, 2021

k-ye mentioned this pull request Apr 27, 2021

[Lang] Extension to pointer snode: Auxiliary bit flags for filtering during listgen #2286

Open

KLozes force-pushed the multilevel branch from 90f9a03 to 82410e8 Compare April 29, 2021 04:13

KLozes added 2 commits April 28, 2021 21:23

merge with upstream master

46362f9

fix insert_scratch_pad

6297389

This was referenced Apr 30, 2021

[lang] Rescale_indix api and test #2308

Closed

[IR] Support DiffRange multiplication #2310

Merged

[lang] Add rescale_index() and test #2313

Merged

[Lang] [ir] Consecutive indices for all snodes #2327

Merged

KLozes closed this May 18, 2021

KLozes deleted the multilevel branch June 23, 2021 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

KLozes commented Apr 6, 2021 •

edited

Loading

k-ye commented Apr 7, 2021 •

edited

Loading

KLozes commented Apr 7, 2021 •

edited

Loading

k-ye commented Apr 9, 2021 •

edited

Loading

xumingkuan left a comment

xumingkuan Apr 21, 2021

k-ye left a comment •

edited

Loading

k-ye Apr 21, 2021

KLozes Apr 21, 2021

k-ye Apr 21, 2021

KLozes Apr 21, 2021

KLozes commented Apr 22, 2021 •

edited

Loading

k-ye commented Apr 24, 2021 •

edited

Loading

k-ye commented Apr 26, 2021 •

edited

Loading

KLozes commented Apr 26, 2021

KLozes commented Apr 26, 2021

k-ye commented Apr 26, 2021 •

edited

Loading

KLozes commented Apr 26, 2021

KLozes commented Apr 26, 2021

k-ye commented Apr 27, 2021

KLozes commented Apr 27, 2021

yuanming-hu commented Apr 28, 2021 •

edited

Loading

KLozes commented Apr 28, 2021

k-ye commented Apr 28, 2021

KLozes commented Apr 29, 2021

k-ye commented Apr 29, 2021

KLozes commented May 18, 2021

	if (diff.related_() and diff.coeff > 0) {
	if (diff.related_() && diff.coeff > 0) {

[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

Conversation

KLozes commented Apr 6, 2021 • edited Loading

k-ye commented Apr 7, 2021 • edited Loading

KLozes commented Apr 7, 2021 • edited Loading

k-ye commented Apr 9, 2021 • edited Loading

xumingkuan left a comment

Choose a reason for hiding this comment

xumingkuan Apr 21, 2021

Choose a reason for hiding this comment

k-ye left a comment • edited Loading

Choose a reason for hiding this comment

k-ye Apr 21, 2021

Choose a reason for hiding this comment

KLozes Apr 21, 2021

Choose a reason for hiding this comment

k-ye Apr 21, 2021

Choose a reason for hiding this comment

KLozes Apr 21, 2021

Choose a reason for hiding this comment

KLozes commented Apr 22, 2021 • edited Loading

k-ye commented Apr 24, 2021 • edited Loading

k-ye commented Apr 26, 2021 • edited Loading

KLozes commented Apr 26, 2021

KLozes commented Apr 26, 2021

k-ye commented Apr 26, 2021 • edited Loading

KLozes commented Apr 26, 2021

KLozes commented Apr 26, 2021

k-ye commented Apr 27, 2021

KLozes commented Apr 27, 2021

yuanming-hu commented Apr 28, 2021 • edited Loading

KLozes commented Apr 28, 2021

k-ye commented Apr 28, 2021

KLozes commented Apr 29, 2021

k-ye commented Apr 29, 2021

KLozes commented May 18, 2021

KLozes commented Apr 6, 2021 •

edited

Loading

k-ye commented Apr 7, 2021 •

edited

Loading

KLozes commented Apr 7, 2021 •

edited

Loading

k-ye commented Apr 9, 2021 •

edited

Loading

k-ye left a comment •

edited

Loading

KLozes commented Apr 22, 2021 •

edited

Loading

k-ye commented Apr 24, 2021 •

edited

Loading

k-ye commented Apr 26, 2021 •

edited

Loading

k-ye commented Apr 26, 2021 •

edited

Loading

yuanming-hu commented Apr 28, 2021 •

edited

Loading