-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [llvm] [ir] Complex multilevel structs. Consecutive indices for all snodes Draft #2248
Conversation
Hi @KLozes , Thanks for proposing this change! It's not a trivial change so I will list my understanding and thoughts here. The trailing bits feature was introduced back in #1558. It adds two things:
blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4) The I can see how these would result in some confusion, such as #1682 . @yuanming-hu probably knows better about the use case.
Right. But this was a bug where Taichi didn't properly enforce
That's the question to discuss...
Right, this does seem more intuitive. But one property Taichi needs to keep is that, when accessing some intermediary SNode with the leaf SNode's index, it will go to the same ancestor (if the leaf SNodes are from the same ancestor). In your example, Appendix: Some notes on
taichi/tests/python/test_trailing_bits.py Lines 8 to 16 in 0035ec0
|
Hi @k-ye Thanks for the discussion. So my idea here is that trailing bits concept no longer exists. Every field will have the indices as if it were the only field placed in the whole program. So the structure in trailing bit test x = ti.field(ti.f32)
z = ti.field(ti.f32)
block = ti.root.pointer(ti.i, 8)
block.dense(ti.i, 32).place(x)
block.dense(ti.i, 16).place(z) is now a valid data structure. The x field will have indices [0,1...255] and z will have indices [0,1,...127]. The key is to compute the 'start' value on-the-fly during access_lowering and demote_dense_struct_for, so that we can correctly access and loop over any field, no matter what fields are placed on the same tree. See lower_access.cpp in my first commit. 8e622d8 (Sorry, couldn't figure out how to get a nice viewable code link) This data structure is also still valid blk = ti.root.dense(ti.i, 2)
a = blk.dense(ti.i, 4)
b = blk.dense(ti.j, 4) and It was also possible to get rid of the 'start' value in listgen using block indices instead of the 'block corner leaf cell' like it was before. This just took a small change to the refine_coordinates function.
As for accessing intermediate snodes. They can still be accessed by dividing by block size. So if we wanted to deactivate cells 8 through 11. We just want to deactivate the pointer with index 2. So we could just call blk.deactivate(9//4) or blk.deactivate(2). You should be able to pull this PR and play around with it yourself. The only thing that isn't working yet is block local storage. |
Right, this PR basically reverts what #1558 has introduced. IIRC, it was added mainly to support this use case: parent = # some SNode
parent.dense(ti.ijk, 8).place(blk)
parent.dynamic(ti.l, 256).place(lst)
for I in ti.grouped(blk):
# Every 8x8x8 dense `blk` cells share the same dynamic `lst`.
# So we can use `I` to access the sibling `lst` cell. This pattern is now used in the taichi MPM solvers. (TODO: Find a real example) Without trailing bits, sharing the same I think what we probably need here is some API to help achieve this need. For exaple, As of this PR, because of the existing usage, it seems that a more reasonable approach would be to first think about what kind of API we need to satisfy the MPM solver's need, migrate the solvers to that API, and then deprecate the trailing bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We might also update the code in other repos such as https://github.com/taichi-dev/taichi_elements to make sure these codes are not broken.
int num_indices = (int)ptr->indices.size(); | ||
for (int i = 0; i < num_indices; i++) { | ||
auto diff = irpass::analysis::value_diff_loop_index(ptr->indices[i], | ||
for_stmt, i); | ||
if (diff.linear_related()) { | ||
if (diff.related_() and diff.coeff > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit :)
if (diff.related_() and diff.coeff > 0) { | |
if (diff.related_() && diff.coeff > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work, thank you so much for the update! However, I still have to point out a few things:
- We need to first add the index rescaling API, so that we don't manually compute that in the kernel. We will also need to check in this API to update other repos before accepting the changes here.
- How do we handle the parent coordinates if the leaf SNode has offsets? Does it start from 0, or the rescaled offset?
- Did we verify that the BLS still worked? I.e. the compiler can still be able to infer the BLS storage size and convert the global index <-> block index. I think one possible approach is to add some CPP unit tests around
initialize_scratch_pad()
andmake_block_local()
. Given that these two functions are so large, it will be great if the implementation of these two passes can be broken down into smaller, testable functions (possibly in a series of PRs)
get_runtime_function(leaf_block->refine_coordinates_func_name()); | ||
|
||
create_call(refine, {parent_coordinates, block_corner_coordinates, | ||
tlctx->get_constant(0)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block corner has the same coordinates as the leaf element of the first (0th) loop_index . And it is the same for any loop_index. The refine function here basically just left shifts the parent coordinates by num_bits.
auto new_coordinates = create_entry_block_alloca(physical_coordinate_ty); | ||
|
||
create_call(refine, {parent_coordinates, new_coordinates, | ||
builder->CreateLoad(loop_index)}); | ||
|
||
if (stmt->snode->type == SNodeType::bit_array && stmt->snode->parent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more refine step was needed for bit_arrays to make the loop coordinates non-consecutive since the operation is going to be vectorized over multiple coordinates per loop_index.
For example, a 1-d block of 64 leaf elements and vectorization of 32, only two loop coordinates are needed. Without this extra refine step, their coordinates would be a consecutive 0, 1 instead of non-consecutive 0, 32.
This step was also not needed before since coordinates were always non-consecutive.
Thanks! I've learned a ton too!
Alright, I'll start think about an API.
I am not sure about this, but offset tests seem to be working fine. One just has to make sure to that offsets are applied to blocks correctly on the front end when necessary. see the changes to bls_test_template.py:
I did verify that the tests in test_bls.py and test_bls_assume_in_range.py work, and they look pretty extensive. I also plan to add some more bls tests with the rescaling API |
Thanks! After taking another look, I'm not entirely sure if this is correct in the first place: taichi/tests/python/bls_test_template.py Lines 170 to 177 in 939e1d3
The question is do we really need to subtract @xumingkuan and I had a discussion to go over how offsets are handled in Taichi right now. Here's the summary: For a
taichi/taichi/ir/frontend_ir.cpp Lines 196 to 207 in f34203f
taichi/taichi/transforms/lower_ast.cpp Lines 302 to 306 in f34203f
A concrete example: x = ...place(..., offset=(-16,)) # ``x`` has an offset of -16
y = ...place(...) # ``y`` has no offset
for I in x:
x[I]
y[I]
for I in x:
x[I - (-16)] # ``x`` has an offset of -16, so its GlobalPtrExpression's indices are subtracted by (-16)
y[I] # ``y`` does not have any offset, so its GlobalPtrExpression's indices are not offseted
for (I - 16) in x: # ``I`` still starts from 0
x[(I - 16) - (-16)] # i.e., x[I]
y[I - 16] So, when we are accessing a leaf SNode with offset, the accessing indices should still begin from 0. It's just that the compiler replaces the loop index Therefore I don't think it makes sense to subtract import taichi as ti
ti.init(arch=ti.cpu, print_ir=True)
y = ti.field(ti.i32)
offset = (-2,)
grid = ti.root.dense(ti.i, 4)
dyn = grid.dynamic(ti.j, 4)
dyn.place(y, offset=offset +(0,))
@ti.kernel
def foo():
ti.append(dyn, [-2], 0)
foo() The final IR is shown as below:
Unfortunately, I don't think that these tests passing would mean anything (sorry I didn't mean to blame). IIUC, if the compiler cannot deduce the BLS size, it would simplify not use the shared memory. So the test would suffer a slower performance, but there won't be any correctness issue. Therefore a better approach would be to test the BLS-related passes in the C++ layer. Let me think about how to refactor the code so that we can achieve that :-) |
Also, could you break down this PR into several smaller ones, so that we can make progress much quicker :-) ? E.g.
|
Hi @k-ye
I see what you mean! The grid_offset definitely should not be required to make this work. this seems to be a bug with how offsets are handled with flattening the append function call.
Ahh. Thanks for picking up this task! |
Sure! I can try. I think it will be best to keep 2 and 3 together. Since the diffrange multiplication was needed to make bls tests work after changing indices to consecutive. 1 and 4 also go together. I actually put trailing bits back in. Since attaching trailing bits to the "place" snode is a good way to implement the index re-scaling. So I can split this PR into 2.
|
Hmm, I thought that "3. remove I think this PR serves as a great baseline to demonstrate our end goal. What we need is just to figure out the path to upstream it piece by piece, without breaking the existing usages or to introduce new bugs. Personally I find these
Ah right! I used to think that we need to follow these steps:
But so long as the trailing bits are not removed, using the API will produce the wrong result.. |
Yes, I suppose actually remove the 'start' property from the extractor was just cleanup. Since I now compute 'start' on-the-fly when it is needed. So it is no longer a property of an snode.
So, I should be more clear here. I'm using trailing bits to implement this API on the backend. I have totally replaced the old way of computing trailing bits. Now trailing bits are attached to the place snode to account for the difference in bits between an snode and its assigned "rescaling_snode". See my new commit. commit abbc4b9 This API doesn't quite work yet. I believe lower_access and alias analysis work. I still have some work to do on on demote_dense_struct_for and listgen. Currently, the StructForStmt and OffloadStmt classes do not have access to the leaf (place) snode to grab the trailing_bits from. The 'snode' member is actually the leaf_block. So I added a new 'leaf_snode' member to them. But for some reason this causes async tests to fail. Any idea why?? |
Thanks! I think I can split it into 3 fresh PRs. But they need to pulled in this order
|
Thanks for the update!
Hmm, doesn't this work against the original goal to drop the trailing bits feature completely? In my (maybe oversimplified) thought, |
@k-ye
So yes, I want to drop automatically computing trailing bits based on sibling snode shapes. Instead, trailing bits would be non-zero only when the rescaled API is called. So with this PR, the rescale_indices_to() function a = ti.field(ti.f32)
b = ti.field(ti.f32)
c = ti.field(ti.f32)
d = ti.field(ti.f32)
blk = ti.root.pointer(ti.ij, 128)
blk.dense(ti.ij, 4).place(a)
blk.dense(ti.ij, 4).place(b)
blk.dynamic(ti.l, 128).place(c)
blk.dynamic(ti.l, 128).place(d)
c.rescale_indices_to(a)
d.rescale_indices_to(b) would give the place(c) and place(d) trailing bits to rescale their ij indices from [0,1,2...,128) to [0,4,8,...512). Which would lead to the same behavior we see in Taichi before this PR. All struct-fors and accesses of c and d will use indices [0,4,8, ...512). My issue with the per-kernel rescaling that you proposed in another comment is that you can only rescale one field in each kernel. I think would cause problems if there are two fields that need to be rescaled, like the example above.
Yep! This basically how I'm computing the trailing bits when this API is called. The trailing bits are just the difference in the total number of bits from root to leaf. The two snodes actually don't even have to be siblings or have the same ijkl indices. |
Hi @KLozes, thank you for being an awesome Taichi contributor! We really appreciate your contribution. We have a Taichi cup souvenir for every Taichi developer. Given that now we have a company around Taichi, we are finally able to ship the souvenir across the world (previously only within mainland China, unfortunately, due to COVID and lack of ¥). If you don't mind, could you share your address with yuanming@taichi.graphics and linan@taichi.graphics? We will then FedEx the cup to you :-) (Sent the message on LinkedIn too a few days ago. You can reply there if you'd like to.) (Sorry about the off-topic post here!) |
Hi @yuanming-hu No problem! I've learned a ton and had fun contributing to Taichi! Thanks, I'd love to have one of those souvenirs :). Sorry about missing your message on LinkedIn! |
Thanks! Maybe it's better to list what we want to fix (#2248 (comment)):
This means the field index is still non-consecutive, so taichi still has to keep the implementation to support that :-(
Hmm, I guess you mean that we can keep the field index rescaled in all the kernels iterating over it? We should be able to rescale the index for any number of fields inside one kernel :-) for I in a:
Ib = ti.rescale_index(a, b, I) # Ib = I // 4
# use `lb` to access `b`
Ic = ti.rescale_index(a, c, I) # Ic = I // 4
# use `lc` to access `c` Also, the fact that we had to introduce a |
Ahh, I was thinking we wanted an API to optionally maintain the old non-consecutive index behavior. And that can only be done with trailing bits. But if you don't think that is necessary, I can certainly implement this (hopefully) simpler API! Looks like it should only require python code changes too. |
Sounds great, thank you!! |
implemented in #2327 |
Related issue = #2177
I think this is pretty important PR. It opens the door for much more complicated structures!
This PR addresses the peculiar behavior seen in listgen and accessing fields with more complex multilevel structs, see the issue above. As @k-ye discovered (thanks!), it is because we assume the first child of an snode when we compute the 'start' member of the extractors.
So I've done away with the 'start' member entirely, and instead they are now calculated on-the-fly where they are needed in demote_struct_for and lower_access IR passes. This was necessary because the list of start values is unique to the leaf-to-root path of the filed being accessed.
I also was able to get rid of the trailing_bits member, since every field can now be accessed independently of the shape of other fields. Which I think is more intuitive.
This PR also changes the definition of indices for parents of leaf snodes. for example, the structure
used to have the indices below. I believe 0,1,2 or 3 could be used the access the first element of dense0.
with this PR, dense0 has unique indices. see the changes to test_sparse_deactivate.py
If someone can figure out a fix for bls that would be really helpful. My attempt did not work, but I think it is on the right track.
[Click here for the format server]