Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248

Closed
wants to merge 33 commits into from

Conversation

KLozes
Copy link
Collaborator

@KLozes KLozes commented Apr 6, 2021

Related issue = #2177

I think this is pretty important PR. It opens the door for much more complicated structures!

This PR addresses the peculiar behavior seen in listgen and accessing fields with more complex multilevel structs, see the issue above. As @k-ye discovered (thanks!), it is because we assume the first child of an snode when we compute the 'start' member of the extractors.

So I've done away with the 'start' member entirely, and instead they are now calculated on-the-fly where they are needed in demote_struct_for and lower_access IR passes. This was necessary because the list of start values is unique to the leaf-to-root path of the filed being accessed.

I also was able to get rid of the trailing_bits member, since every field can now be accessed independently of the shape of other fields. Which I think is more intuitive.

This PR also changes the definition of indices for parents of leaf snodes. for example, the structure

root.dense(ti.i, 4).dense(ti.i, 4).place(x)

used to have the indices below. I believe 0,1,2 or 3 could be used the access the first element of dense0.

dense1:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

dense0:  0         4         8          12

root:    0

with this PR, dense0 has unique indices. see the changes to test_sparse_deactivate.py

dense1:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

dense0:  0         1         2           3

root:    0

If someone can figure out a fix for bls that would be really helpful. My attempt did not work, but I think it is on the right track.

[Click here for the format server]


@KLozes KLozes added help wanted error-prone This PR may introduce potential bug if not carefully reviewed & tested labels Apr 6, 2021
@KLozes KLozes requested review from yuanming-hu and k-ye April 6, 2021 05:18
@KLozes KLozes changed the title [Bug] [LLVM] [IR] [Lang] complex multilevel structs DRAFT [Bug][LLVM][IR][Lang] complex multilevel structs DRAFT Apr 6, 2021
@KLozes KLozes changed the title [Bug][LLVM][IR][Lang] complex multilevel structs DRAFT [Bug] [llvm] [ir] complex multilevel structs DRAFT Apr 6, 2021
@KLozes KLozes changed the title [Bug] [llvm] [ir] complex multilevel structs DRAFT [Bug] [llvm] [ir] Complex multilevel structs DRAFT Apr 6, 2021
@k-ye
Copy link
Member

k-ye commented Apr 7, 2021

Hi @KLozes ,

Thanks for proposing this change! It's not a trivial change so I will list my understanding and thoughts here.

The trailing bits feature was introduced back in #1558. It adds two things:

  1. It requires that the shapes of all the children at a given dimension (index) are the same. This invalidates such kind of usage:
    # Here the numbers of bits of x and z are inconsistent,
    # which leads to the RuntimeError below.
    block.dense(ti.i, 32).place(x)
    block.dense(ti.i, 16).place(z)
  2. Shapes from other indices of other children are "attached" to a child SNode. Effectively, the other children's shapes span the space of a given child. This is done in
    std::function<void(SNode &)> top_down = [&](SNode &s) {
    for (auto &c : s.ch) {
    if (s.type != SNodeType::root)
    for (int i = 0; i < taichi_max_num_indices; i++) {
    c->extractors[i].trailing_bits =
    s.extractors[i].trailing_bits - c->extractors[i].num_bits;
    }
    top_down(*c);
    }
    };
    top_down(snode);
    . Say if we have
blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4)

The i-th dimension of a gets attached to b, so b's shape is (2x4, 4) = (8, 4).

I can see how these would result in some confusion, such as #1682 . @yuanming-hu probably knows better about the use case.

it is because we assume the first child of an snode when we compute the 'start' member of the extractors.

Right. But this was a bug where Taichi didn't properly enforce 1. I.e., if Taichi fixed this bug, the example in #2177 would result in Taichi throwing a "Inconsistent trailing bit configuration" error. As you've pointed out, whether to keep this feature is another question.

I also was able to get rid of the trailing_bits member, since every field can now be accessed independently of the shape of other fields.

That's the question to discuss...

This PR also changes the definition of indices during for parents of leaf snodes.

Right, this does seem more intuitive. But one property Taichi needs to keep is that, when accessing some intermediary SNode with the leaf SNode's index, it will go to the same ancestor (if the leaf SNodes are from the same ancestor). In your example, dense0[0], dense0[1], dense0[2], dense0[3] will all point to the same cell. (We can see if there is a way to remove the necessity of this property to make the implementation more intuitive)


Appendix: Some notes on IndexExtractor for the future me... This struct records the information about a particular dimension (index).

  • num_bits: Number of bits required to store the index. E.g. if a SNode is of shape (ti.ij, (4, 8)). Then it needs 2 bits for the i-th dimension and 3 bits for the j-th.
  • num_elements: Max number of elements (not padded to the power-of-two)
  • start: This is the starting bit index of the snode for a coordinate value at the given dimension . It characterizes the relationship between a parent and a child snode. parent.start = child.start + child.num_bits. For example, if there are two 1D snodes a and b, where b = a.dense(ti.i, 8), we have a.start = b.start + 3 for the i-th dimension. When accessing b[15], Taichi first goes through a with (15 >> 3) = 1.
    int inferred = ch->extractors[i].start + ch->extractors[i].num_bits;
    if (ch_id == 0) {
    snode.extractors[i].start = inferred;
  • acc_offset: Accumulated offset. This is the starting bit index within a flattened loop index value. It describes the relationship of the bit offsets between different dimensions for a given snode. extractor[i].acc_offset = extractor[i + 1].acc_offset + num_bits. For example, for a 3D snode a of (ti.ijk, (2, 4, 8)) and a flattened loop index li, bits [0, 3) in li are for the k-th dim, bits [3, 5) are for the j-th, and bits [5, 6) the i-th.
    https://github.com/taichi-dev/taichi/blob/0035ec061a000e77f8507e6240ed8be6c8d59eb8/taichi/struct/struct.cpp#L68-L78
  • trailing_bits: This is quite similar to start, i.e. parent.trailing_bits = child.trailing_bits + child.num_bits. However, after this bottom-up compute, it goes through another top-down pass.
    std::function<void(SNode &)> top_down = [&](SNode &s) {
    for (auto &c : s.ch) {
    if (s.type != SNodeType::root)
    for (int i = 0; i < taichi_max_num_indices; i++) {
    c->extractors[i].trailing_bits =
    s.extractors[i].trailing_bits - c->extractors[i].num_bits;
    }
    top_down(*c);
    }
    };
    top_down(snode);
    This is useful when a parent has multiple children that have used different dimensions. The effect can be seen from this test case:

x = ti.field(ti.f32)
y = ti.field(ti.f32)
block = ti.root.pointer(ti.i, 8)
block.dense(ti.i, 32).place(x)
# Here every 32 ti.i share the same dense node of 16 y along ti.j.
block.dense(ti.j, 16).place(y)
assert y.shape == (256, 16)

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 7, 2021

Hi @k-ye

Thanks for the discussion.

So my idea here is that trailing bits concept no longer exists. Every field will have the indices as if it were the only field placed in the whole program. So the structure in trailing bit test

    x = ti.field(ti.f32)
    z = ti.field(ti.f32)

    block = ti.root.pointer(ti.i, 8)
    block.dense(ti.i, 32).place(x)
    block.dense(ti.i, 16).place(z)

is now a valid data structure. The x field will have indices [0,1...255] and z will have indices [0,1,...127]. The key is to compute the 'start' value on-the-fly during access_lowering and demote_dense_struct_for, so that we can correctly access and loop over any field, no matter what fields are placed on the same tree. See lower_access.cpp in my first commit. 8e622d8 (Sorry, couldn't figure out how to get a nice viewable code link)

This data structure is also still valid

blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4)

and b will no longer inherit any dimensions from a. So a will be 1d with shape (8) . b will be 2d with shape (2,4). this is possible thanks the the on-the-fly 'start' value computation.

It was also possible to get rid of the 'start' value in listgen using block indices instead of the 'block corner leaf cell' like it was before. This just took a small change to the refine_coordinates function.


blk = root.pointer(ti.i, 4).
blk.dense(ti.i, 4).place(x)

old listgen

dense:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

pointer:  0         4         8          12

root:    0


new listgen

dense:  0 1 2 3   4 5 6 7   8 9 10 11  12 13 14 15 

pointer:  0         1         2           3

root:    0

As for accessing intermediate snodes. They can still be accessed by dividing by block size. So if we wanted to deactivate cells 8 through 11. We just want to deactivate the pointer with index 2. So we could just call blk.deactivate(9//4) or blk.deactivate(2).

You should be able to pull this PR and play around with it yourself. The only thing that isn't working yet is block local storage.

@k-ye
Copy link
Member

k-ye commented Apr 9, 2021

Right, this PR basically reverts what #1558 has introduced. IIRC, it was added mainly to support this use case:

parent = # some SNode
parent.dense(ti.ijk, 8).place(blk)
parent.dynamic(ti.l, 256).place(lst)

for I in ti.grouped(blk):
  # Every 8x8x8 dense `blk` cells share the same dynamic `lst`. 
  # So we can use `I` to access the sibling `lst` cell. 

This pattern is now used in the taichi MPM solvers. (TODO: Find a real example)

Without trailing bits, sharing the same lst cell among the 8x8x8 blk cells requires one to manually divide the index by (8, 8, 8), which is kind of repetitive. Ideally, we should only need to define the shape of the SNodes once in the program. In another word, this violates the design goal that the computation is decoupled from the data structure.

I think what we probably need here is some API to help achieve this need. For exaple, rescale_index(index, from_snode, to_snode): returns index * shape(to_snode) / shape(from_snode). For this API, it may need to have additional preconditions, e.g. one snode must be the ancestor of the other, etc. Hopefully this strikes a balance between keeping taichi's hierarchical data structure intuitive and satisfying the need of the MPM solvers.

As of this PR, because of the existing usage, it seems that a more reasonable approach would be to first think about what kind of API we need to satisfy the MPM solver's need, migrate the solvers to that API, and then deprecate the trailing bits.

@KLozes KLozes removed error-prone This PR may introduce potential bug if not carefully reviewed & tested help wanted labels Apr 12, 2021
Copy link
Contributor

@xumingkuan xumingkuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We might also update the code in other repos such as https://github.com/taichi-dev/taichi_elements to make sure these codes are not broken.

int num_indices = (int)ptr->indices.size();
for (int i = 0; i < num_indices; i++) {
auto diff = irpass::analysis::value_diff_loop_index(ptr->indices[i],
for_stmt, i);
if (diff.linear_related()) {
if (diff.related_() and diff.coeff > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit :)

Suggested change
if (diff.related_() and diff.coeff > 0) {
if (diff.related_() && diff.coeff > 0) {

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work, thank you so much for the update! However, I still have to point out a few things:

  1. We need to first add the index rescaling API, so that we don't manually compute that in the kernel. We will also need to check in this API to update other repos before accepting the changes here.
  2. How do we handle the parent coordinates if the leaf SNode has offsets? Does it start from 0, or the rescaled offset?
  3. Did we verify that the BLS still worked? I.e. the compiler can still be able to infer the BLS storage size and convert the global index <-> block index. I think one possible approach is to add some CPP unit tests around initialize_scratch_pad() and make_block_local(). Given that these two functions are so large, it will be great if the implementation of these two passes can be broken down into smaller, testable functions (possibly in a series of PRs)

get_runtime_function(leaf_block->refine_coordinates_func_name());

create_call(refine, {parent_coordinates, block_corner_coordinates,
tlctx->get_constant(0)});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block corner has the same coordinates as the leaf element of the first (0th) loop_index . And it is the same for any loop_index. The refine function here basically just left shifts the parent coordinates by num_bits.

auto new_coordinates = create_entry_block_alloca(physical_coordinate_ty);

create_call(refine, {parent_coordinates, new_coordinates,
builder->CreateLoad(loop_index)});

if (stmt->snode->type == SNodeType::bit_array && stmt->snode->parent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more refine step was needed for bit_arrays to make the loop coordinates non-consecutive since the operation is going to be vectorized over multiple coordinates per loop_index.
For example, a 1-d block of 64 leaf elements and vectorization of 32, only two loop coordinates are needed. Without this extra refine step, their coordinates would be a consecutive 0, 1 instead of non-consecutive 0, 32.
This step was also not needed before since coordinates were always non-consecutive.

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 22, 2021

This is great work, thank you so much for the update! However, I still have to point out a few things:

Thanks! I've learned a ton too!

  1. We need to first add the index rescaling API, so that we don't manually compute that in the kernel. We will also need to check in this API to update other repos before accepting the changes here.

Alright, I'll start think about an API.

  1. How do we handle the parent coordinates if the leaf SNode has offsets? Does it start from 0, or the rescaled offset?

I am not sure about this, but offset tests seem to be working fine. One just has to make sure to that offsets are applied to blocks correctly on the front end when necessary. see the changes to bls_test_template.py:
f1e5819#diff-2ea5f113062438816406fbdb85d44d0b44fde412eed818ea1c06c9584c887277R138-R154

  1. Did we verify that the BLS still worked? I.e. the compiler can still be able to infer the BLS storage size and convert the global index <-> block index. I think one possible approach is to add some CPP unit tests around initialize_scratch_pad() and make_block_local(). Given that these two functions are so large, it will be great if the implementation of these two passes can be broken down into smaller, testable functions (possibly in a series of PRs)

I did verify that the tests in test_bls.py and test_bls_assume_in_range.py work, and they look pretty extensive. I also plan to add some more bls tests with the rescaling API

@k-ye
Copy link
Member

k-ye commented Apr 24, 2021

I am not sure about this, but offset tests seem to be working fine. One just has to make sure to that offsets are applied to blocks correctly on the front end when necessary. see the changes to bls_test_template.py:
f1e5819#diff-2ea5f113062438816406fbdb85d44d0b44fde412eed818ea1c06c9584c887277R138-R154

Thanks! After taking another look, I'm not entirely sure if this is correct in the first place:

for i in x:
# It is important to ensure insert and p2g uses the exact same way to compute the base
# coordinates. Otherwise there might be coordinate mismatch due to float-point errors.
base = ti.Vector([
int(ti.floor(x[i][0] * N) - grid_offset[0]),
int(ti.floor(x[i][1] * N) - grid_offset[1])
])
ti.append(pid.parent(), base, i)

The question is do we really need to subtract offset when computing base?


@xumingkuan and I had a discussion to go over how offsets are handled in Taichi right now. Here's the summary:

For a place SNode with offset, the offset is useful only when we iterate it using a struct-for.

  1. When flattening a GlobalPtrExpression (.e.g. x[i, j]), the indices are first subtracted by offsets:

if (bool(var) && var.is<GlobalVariableExpression>()) {
snode = var.cast<GlobalVariableExpression>()->snode;
offsets = snode->index_offsets;
}
for (int i = 0; i < (int)indices.size(); i++) {
indices.exprs[i]->flatten(ctx);
Stmt *ind = indices.exprs[i]->stmt;
if (!offsets.empty()) {
// Subtract offsets from indices so that new indices are
// within [0, +inf)
auto offset = ctx->push_back<ConstStmt>(TypedConstant(offsets[i]));
ind = ctx->push_back<BinaryOpStmt>(BinaryOpType::sub, ind, offset);

  1. Later, the loop index of the struct-for is transformed into BinOp(+, loop_index, offset) in lower_ast():

if ((int)offsets.size() > i && offsets[i] != 0) {
auto offset_const =
new_statements.push_back<ConstStmt>(TypedConstant(offsets[i]));
auto result = new_statements.push_back<BinaryOpStmt>(
BinaryOpType::add, loop_index, offset_const);

A concrete example:

x = ...place(..., offset=(-16,))  # ``x`` has an offset of -16
y = ...place(...)  # ``y`` has no offset

for I in x:
  x[I]
  y[I]

GlobalPtrExpression::flatten() --->

for I in x:
  x[I - (-16)]  # ``x`` has an offset of -16, so its GlobalPtrExpression's indices are subtracted by (-16)
  y[I]  # ``y`` does not have any offset, so its GlobalPtrExpression's indices are not offseted

lower_ast() --->

for (I - 16) in x:  # ``I`` still starts from 0
  x[(I - 16) - (-16)]  # i.e., x[I]
  y[I - 16]

So, when we are accessing a leaf SNode with offset, the accessing indices should still begin from 0. It's just that the compiler replaces the loop index I statement with a binary op of I + offset.


Therefore I don't think it makes sense to subtract offset in that test case. To further support this point, consider this simpler example:

import taichi as ti

ti.init(arch=ti.cpu, print_ir=True)

y = ti.field(ti.i32)

offset = (-2,)

grid = ti.root.dense(ti.i, 4)
dyn = grid.dynamic(ti.j, 4)
dyn.place(y, offset=offset +(0,))

@ti.kernel
def foo():
  ti.append(dyn, [-2], 0)

foo()

The final IR is shown as below:

kernel {
  $0 = offloaded  
  body {
    <i32> $1 = const [0]
    <*gen> $2 = get root
    <*gen> $3 = [S0root][root]::lookup($2, $1) activate = false
    <*gen> $4 = get child [S0root->S1dense] $3
    <i32> $5 = const [2]
    <*gen> $6 = [S1dense][dense]::lookup($4, $5) activate = false
    <*gen> $7 = get child [S1dense->S2dynamic] $6
    <i32> $8 = append [S2dynamic] ptr = $7, val = $1
  }
}

$5 = const [2] is what [-2] gets translated to, which should have been a const of 0 (the offset is -2). Moreover, different offset here actually doesn't make a difference, and the const 2 is simply derived from -2 & 3:

    <i32> $1 = const [-2]
    <i32> $2 = const [0]
    <i32> $3 = shuffle $1[0]  # same as $1
    ...
    <i32> $8 = const [0]
    <i32> $9 = bit_sar $3 $8  # same as $1
    <i32> $10 = const [3]
    <i32> $11 = bit_and $9 $10  <-- equiv. to ($1 & $10) = (-2 & 3) = 2

I did verify that the tests in test_bls.py and test_bls_assume_in_range.py work, and they look pretty extensive.

Unfortunately, I don't think that these tests passing would mean anything (sorry I didn't mean to blame). IIUC, if the compiler cannot deduce the BLS size, it would simplify not use the shared memory. So the test would suffer a slower performance, but there won't be any correctness issue. Therefore a better approach would be to test the BLS-related passes in the C++ layer. Let me think about how to refactor the code so that we can achieve that :-)

@k-ye
Copy link
Member

k-ye commented Apr 26, 2021

Also, could you break down this PR into several smaller ones, so that we can make progress much quicker :-) ? E.g.

  1. rescale_index API
  2. changes to deal with DiffRange multiplication and scratch pad initialisation
  3. changes to remove start
  4. changes to remove the trailing bits (might need further breaking down)

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 26, 2021

Hi @k-ye

$5 = const [2] is what [-2] gets translated to, which should have been a const of 0 (the offset is -2). Moreover, different offset here actually doesn't make a difference, and the const 2 is simply derived from -2 & 3

I see what you mean! The grid_offset definitely should not be required to make this work. this seems to be a bug with how offsets are handled with flattening the append function call.

Unfortunately, I don't think that these tests passing would mean anything (sorry I didn't mean to blame). IIUC, if the compiler cannot deduce the BLS size, it would simplify not use the shared memory. So the test would suffer a slower performance, but there won't be any correctness issue. Therefore a better approach would be to test the BLS-related passes in the C++ layer. Let me think about how to refactor the code so that we can achieve that :-)

Ahh. Thanks for picking up this task!

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 26, 2021

Also, could you break down this PR into several smaller ones, so that we can make progress much quicker? E.g.
1. rescale_index API
2. changes to deal with DiffRange multiplication and scratch pad initialisation
3. changes to remove start
4. changes to remove the trailing bits (might need further breaking down)

Sure! I can try.

I think it will be best to keep 2 and 3 together. Since the diffrange multiplication was needed to make bls tests work after changing indices to consecutive. 1 and 4 also go together. I actually put trailing bits back in. Since attaching trailing bits to the "place" snode is a good way to implement the index re-scaling.

So I can split this PR into 2.

  1. Consecutive indices for all snodes
  2. API for index re-scaling

@k-ye
Copy link
Member

k-ye commented Apr 26, 2021

I think it will be best to keep 2 and 3 together. Since the diffrange multiplication was needed to make bls tests work after changing indices to consecutive.

Hmm, I thought that "3. remove start" was purely a cleanup that is not strongly tied to other changes here. Maybe I missed something. Still, in general I prefer breaking down a big PR containing a series of related changes into individual ones. It should save both of us some efforts to discuss back and forth :-)

I think this PR serves as a great baseline to demonstrate our end goal. What we need is just to figure out the path to upstream it piece by piece, without breaking the existing usages or to introduce new bugs. Personally I find these git tips useful:

1 and 4 also go together. I actually put trailing bits back in. Since attaching trailing bits to the "place" snode is a good way to implement the index re-scaling.

Ah right! I used to think that we need to follow these steps:

  1. add the rescaling index API to taichi
  2. migrate the existing code using the trailing bits feature to this new API
  3. remove the trailing bits from taichi

But so long as the trailing bits are not removed, using the API will produce the wrong result..

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 26, 2021

Hmm, I thought that "3. remove start" was purely a cleanup that is not strongly tied to other changes here

Yes, I suppose actually remove the 'start' property from the extractor was just cleanup. Since I now compute 'start' on-the-fly when it is needed. So it is no longer a property of an snode.

Ah right! I used to think that we need to follow these steps:

  1. add the rescaling index API to taichi
  2. migrate the existing code using the trailing bits feature to this new API
  3. remove the trailing bits from taichi

But so long as the trailing bits are not removed, using the API will produce the wrong result.

So, I should be more clear here. I'm using trailing bits to implement this API on the backend. I have totally replaced the old way of computing trailing bits. Now trailing bits are attached to the place snode to account for the difference in bits between an snode and its assigned "rescaling_snode". See my new commit. commit abbc4b9
Maybe I should rename trailing_bits to rescaling_bits?

This API doesn't quite work yet. I believe lower_access and alias analysis work.

I still have some work to do on on demote_dense_struct_for and listgen. Currently, the StructForStmt and OffloadStmt classes do not have access to the leaf (place) snode to grab the trailing_bits from. The 'snode' member is actually the leaf_block. So I added a new 'leaf_snode' member to them. But for some reason this causes async tests to fail. Any idea why??

@KLozes KLozes changed the title [Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes [Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft Apr 26, 2021
@KLozes
Copy link
Collaborator Author

KLozes commented Apr 26, 2021

I think this PR serves as a great baseline to demonstrate our end goal. What we need is just to figure out the path to upstream it piece by piece, without breaking the existing usages or to introduce new bugs. Personally I find these git tips useful:

Thanks!

I think I can split it into 3 fresh PRs. But they need to pulled in this order

  1. diffrange multiplication
  2. consecutive indices.
  3. rescaling API

@k-ye
Copy link
Member

k-ye commented Apr 27, 2021

Thanks for the update!

I'm using trailing bits to implement this API on the backend. I have totally replaced the old way of computing trailing bits. Now trailing bits are attached to the place snode to account for the difference in bits between an snode and its assigned "rescaling_snode".

Hmm, doesn't this work against the original goal to drop the trailing bits feature completely? In my (maybe oversimplified) thought, rescale_index(from_snode, to_snode, index) can simply look at the shape of the two snodes, then divide one by the other to get the rescaling coefficient. (We may need additional preconditions on this method, such as the indices used of these two SNodes must be the same, i.e., both using ti.ijk. They must be siblings, etc)

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 27, 2021

@k-ye
Thanks for the discussion!

Hmm, doesn't this work against the original goal to drop the trailing bits feature completely?

So yes, I want to drop automatically computing trailing bits based on sibling snode shapes. Instead, trailing bits would be non-zero only when the rescaled API is called. So with this PR, the rescale_indices_to() function

a = ti.field(ti.f32)
b = ti.field(ti.f32)
c = ti.field(ti.f32)
d = ti.field(ti.f32)

blk = ti.root.pointer(ti.ij, 128)
blk.dense(ti.ij, 4).place(a)
blk.dense(ti.ij, 4).place(b)
blk.dynamic(ti.l, 128).place(c)
blk.dynamic(ti.l, 128).place(d)

c.rescale_indices_to(a)
d.rescale_indices_to(b)

would give the place(c) and place(d) trailing bits to rescale their ij indices from [0,1,2...,128) to [0,4,8,...512). Which would lead to the same behavior we see in Taichi before this PR. All struct-fors and accesses of c and d will use indices [0,4,8, ...512).

My issue with the per-kernel rescaling that you proposed in another comment is that you can only rescale one field in each kernel. I think would cause problems if there are two fields that need to be rescaled, like the example above.

In my (maybe oversimplified) thought, rescale_index(from_snode, to_snode, index) can simply look at the shape of the two snodes, then divide one by the other to get the rescaling coefficient.

Yep! This basically how I'm computing the trailing bits when this API is called. The trailing bits are just the difference in the total number of bits from root to leaf. The two snodes actually don't even have to be siblings or have the same ijkl indices.

@yuanming-hu
Copy link
Member

yuanming-hu commented Apr 28, 2021

Hi @KLozes, thank you for being an awesome Taichi contributor! We really appreciate your contribution. We have a Taichi cup souvenir for every Taichi developer. Given that now we have a company around Taichi, we are finally able to ship the souvenir across the world (previously only within mainland China, unfortunately, due to COVID and lack of ¥). If you don't mind, could you share your address with yuanming@taichi.graphics and linan@taichi.graphics? We will then FedEx the cup to you :-)

(Sent the message on LinkedIn too a few days ago. You can reply there if you'd like to.)

taichicup

(Sorry about the off-topic post here!)

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 28, 2021

Hi @yuanming-hu

No problem! I've learned a ton and had fun contributing to Taichi! Thanks, I'd love to have one of those souvenirs :). Sorry about missing your message on LinkedIn!

@k-ye
Copy link
Member

k-ye commented Apr 28, 2021

Thanks! Maybe it's better to list what we want to fix (#2248 (comment)):

  1. Remove the trailing bits, otherwise it complicates how a SNode's shape is computed
  2. Make indexing consecutive.

would give the place(c) and place(d) trailing bits to rescale their ij indices from [0,1,2...,128) to [0,4,8,...512).

This means the field index is still non-consecutive, so taichi still has to keep the implementation to support that :-(

My issue with the per-kernel rescaling that you proposed in another comment is that you can only rescale one field in each kernel.

Hmm, I guess you mean that we can keep the field index rescaled in all the kernels iterating over it? We should be able to rescale the index for any number of fields inside one kernel :-)

for I in a:
  Ib = ti.rescale_index(a, b, I) # Ib = I // 4
  # use `lb` to access `b`
  Ic = ti.rescale_index(a, c, I)  # Ic = I // 4
  # use `lc` to access `c`

Also, the fact that we had to introduce a rescale_snode in SNode itself means we're still paying the price of changing a pretty generic data structure to support a narrow use case. As mentioned before, I hope that we can agree that this complexity should be scoped into a special API, rather than making it part of the SNode data structure.

@KLozes
Copy link
Collaborator Author

KLozes commented Apr 29, 2021

@k-ye

Ahh, I was thinking we wanted an API to optionally maintain the old non-consecutive index behavior. And that can only be done with trailing bits. But if you don't think that is necessary, I can certainly implement this (hopefully) simpler API! Looks like it should only require python code changes too.

@k-ye
Copy link
Member

k-ye commented Apr 29, 2021

I can certainly implement this (hopefully) simpler API! Looks like it should only require python code changes too.

Sounds great, thank you!!

@KLozes
Copy link
Collaborator Author

KLozes commented May 18, 2021

implemented in #2327

@KLozes KLozes closed this May 18, 2021
@KLozes KLozes deleted the multilevel branch June 23, 2021 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants