Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lang] [IR] Kernel scalar return support (ArgStoreStmt -> KernelReturnStmt) #917

Merged
merged 25 commits into from
May 8, 2020

Conversation

archibate
Copy link
Collaborator

@archibate archibate commented May 3, 2020

Related issue = #909

Currently I'm still using arg[0] to store return value, clearly you want an extra buffer in context to store return value (helps multi-return).
That's good for other backends but I refuse this for OpenGL, reason: GL limits buffers number to 10 , however we already have up to 13~17 if all buffers are turned on.
So if we have external array, arguments, global temp, 64-bit integer used in a single kernel, GL fails. We depends on kernel to be simple and fortunately no test have covered that complicated combination yet..
I will try to combine them together so arg buffer is still r/w - first part write-only (ret) and second part read-only (arg).

To test:

import taichi as ti

ti.init(arch=ti.opengl, print_preprocessed=True)

@ti.kernel
def func(t: ti.i32):
    return 233


res = func(666)
print(res)

[Click here for the format server]

@archibate
Copy link
Collaborator Author

archibate commented May 3, 2020

@yuanming-hu Things to consider:

  1. do we type-hint return type, or determined by the last return rhs?
  2. How about diff kernels? I'm not very knowledged about difftaichi... sorry.

@archibate
Copy link
Collaborator Author

archibate commented May 4, 2020

T taichi_union_cast_with_different_sizes(G g) {
union {
T t;
G g;
} u;
u.g = g;
return u.t;
}

What if sizeof(G) < sizeof(T)? Will the higher part not zero-initialized?
What if it's signed type? Will it's sign bit extended correctly? (e.g. 1011 -> 11111011)

@archibate archibate marked this pull request as ready for review May 4, 2020 04:33
@archibate
Copy link
Collaborator Author

How do I transfer ret_type from transformer.py to kernel.py?

@yuanming-hu
Copy link
Member

Good questions.

  1. do we type-hint return type, or determined by the last return rhs?

We should type-hint @ti.kernels.

2. How about diff kernels? I'm not very knowledged about difftaichi... sorry.

Let's simply disallow diffTaichi kernels to have return types.

How do I transfer ret_type from transformer.py to kernel.py?

Not sure if I clearly understand your question but I believe you can figure out a solution on your own.

Copy link
Collaborator Author

@archibate archibate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEXT: separate ret and arg.

@@ -145,10 +169,42 @@ void Kernel::set_arg_int(int i, int64 d) {
}
}

int64 Kernel::get_arg_int(int i) { // TODO: will consider uint?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does unsigned/signed extension matter for get? I just copied your set_arg_int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine.

@@ -1646,6 +1646,22 @@ class FuncCallStmt : public Stmt {
DEFINE_ACCEPT
};

class KernelReturnStmt : public Stmt {
public:
Stmt *value;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put multi-return in another PR.

Comment on lines 505 to 507
emit("_args_{}_[0] = {};", // TD: correct idx, another buf
"i64",//data_type_short_name(stmt->element_type()),
stmt->value->short_name());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad, still using arg[0] for return. We want ret[0] instead.
Also pls update context. get_ret_int.

@@ -179,6 +179,9 @@ def reset(self):
self.compiled_functions = self.runtime.compiled_grad_functions

def extract_arguments(self):
self.arguments.append(i64) # TODO: rettype
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: self.arguments -> self.argument_types (broken window).

@archibate
Copy link
Collaborator Author

Dirty hack into argret, now, to test:

import taichi as ti

ti.init(arch=ti.opengl, print_preprocessed=True)

@ti.kernel
def func(a: ti.i64) -> ti.i64:
    return a * 2


res = func(233)
print(res) # 466

@archibate
Copy link
Collaborator Author

How to inspect function return annotation by import inspect?

@archibate
Copy link
Collaborator Author

NEXT: figure out how to define element_type() for KernelReturnStmt. Sleep now to prevent a class drop, cutmr.

@archibate archibate requested a review from yuanming-hu May 6, 2020 07:59
@archibate
Copy link
Collaborator Author

Done with LLVM here, now check out:

import taichi as ti

ti.init(arch=ti.x64, print_preprocessed=True, print_ir=True)

@ti.kernel
def func(a: ti.i64, b: ti.f64) -> ti.f64:
    return a * b

res = func(100, 2.333)
print(res)

@yuanming-hu yuanming-hu requested a review from xumingkuan May 7, 2020 03:38
Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to add Metal support for this once its' in?

BTW, am I correct in saying that this PR is using args[0] as the return value? Could this be problematic if the kernel takes in argument?

@archibate
Copy link
Collaborator Author

I think I need to add Metal support for this once its' in?

Yes, would you like to do it in this PR or not?

BTW, am I correct in saying that this PR is using args[0] as the return value?

Yes, it's args[0] for metal and opengl, and result_buffer[0] for llvm.

problematic if the kernel takes in argument?

Not a problem since KernelReturnStmt will always be the last statement.

@k-ye
Copy link
Member

k-ye commented May 7, 2020

Yes, would you like to do it in this PR or not?

I can take care of that in another PR, but thanks!

it's args[0] for metal and opengl, and result_buffer[0] for llvm. Not a problem since KernelReturnStmt will always be the last statement.

OK. I see that you have a TODO to consider using result buffer for OpenGL as well. I guess it's probably cleaner if we:

  1. Follow the TODO and separate return args from input args; or
  2. Similar to the existing design, where args and return args are in the same buffer, but they have different indices.

WDYT? (I'm not suggesting to do this here, just a goal in the future)

@archibate archibate changed the title [IR] [Refactor] Add KernelReturnStmt [Lang] [IR] [Refactor] Add KernelReturnStmt to support return in kernel May 7, 2020
@archibate archibate changed the title [Lang] [IR] [Refactor] Add KernelReturnStmt to support return in kernel [Lang] [IR] Kernel return support: remove ArgStoreStmt, add KernelReturnStmt May 7, 2020
@archibate
Copy link
Collaborator Author

I prefer 2 although LLVM is already on 1.
First, implementing this can be hard thanks to OpenGL API limitations as mentioned in PR description, no matter our philosophy.
Second, having one buffer RO and another one WO sounds like a waste. Also they won't influence each other as demostrated. Why not just combine them together? Having different indices/typings is already good enough. So I suggest LLVM&Metal use 2 too.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalized my pass. Looks great in general except for a few nits...

} else if (dt == DataType::i64) {
return (float64)fetch_result<int64>(i);
} else if (dt == DataType::i8) {
return (float64)fetch_result<int8>(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I trust your decision.

}

template <typename T>
static T fetch_result(int i) // TODO: move to Program::fetch_result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to move fetch_result_uint64 and this to Program in this PR. The modification would not be too big I guess, since you already need a few get_current_program calls in fetch_result_uint64.

taichi/ir/snode.cpp Outdated Show resolved Hide resolved
uint64 SNode::fetch_reader_result() {
uint64 ret;
auto arch = get_current_program().config.arch;
if (arch == Arch::cuda) {
// TODO: refactor
// XXX: what about unified memory?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XXX: what about unified memory?
// We use a `memcpy_device_to_host` call here even if we have unified memory. This simplifies code. Also note that a unified memory (4KB) page fault is rather expensive for reading 4-8 bytes.

}
auto &rets = current_kernel->rets;
TI_ASSERT(rets.size() >= 1);
auto ret = rets[0]; // TODO: stmt->ret_id?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ret = rets[0]; // TODO: stmt->ret_id?
// TODO: Support cases when stmt->ret_id other than 0
TI_ASSERT(stmt->ret_id == 0);
auto ret = rets[0];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to let unsupported cases fail loudly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but we don't have ret_id yet. Will add in another PR.

@yuanming-hu
Copy link
Member

I prefer 2 although LLVM is already on 1.
First, implementing this can be hard thanks to OpenGL API limitations as mentioned in PR description, no matter our philosophy.
Second, having one buffer RO and another one WO sounds like a waste. Also they won't influence each other as demostrated. Why not just combine them together? Having different indices/typings is already good enough. So I suggest LLVM&Metal use 2 too.

I think either is fine and for each backend, it's better to choose a design that is most suitable. I understand that on OpenGL the number of buffers is limited, so it makes sense to go 2. For LLVM, going 1 will make things easier.

@k-ye
Copy link
Member

k-ye commented May 7, 2020

Yeah, I think this is a per backend design choice. Metal is already mixing args and return values in one buffer, so I think its easier to do 2. Thanks for the confirmation!

@archibate
Copy link
Collaborator Author

Thank for your valuable discussions! Nice to know we could have flexibility for each backend.
Also I add three further consideration after this PR merged: ti.ext_arr()? ti.template()? ti.Matrix?

@archibate archibate changed the title [Lang] [IR] Kernel return support: remove ArgStoreStmt, add KernelReturnStmt [Lang] [IR] Kernel scalar return support (ArgStoreStmt -> KernelReturnStmt) May 7, 2020
@yuanming-hu
Copy link
Member

Also I add three further consideration after this PR merged: ti.ext_arr()? ti.template()? ti.Matrix?

  • ti.ext_arr() currently abuses the argument buffer to store the pointer and array sizes. This would need a refactoring in the future for more clarity. This is not urgent though.
  • ti.template() is in charge of template instantiation and is treated fully in the frontend. I wouldn't worry too much about this.
  • ti.Matrix clearly needs more considerations. I'm not sure how the return statement should be designed for this struct.

@archibate
Copy link
Collaborator Author

archibate commented May 7, 2020

Having hard writing test:

import taichi as ti
from taichi import approx

def _test_binary_func_ret(dt1, dt2, dt3):
    ti.init(print_preprocessed=True)

    x = ti.var(ti.i32, ())

    @ti.kernel
    def func(a: dt1, b: dt2) -> dt3:
        # dummy = dt3 # uncomment to pass
        return a * b

    if ti.core.is_integral(dt1):
        xs = list(range(4))
    else:
        xs = [0.2, 0.4, 0.8, 1.0]

    if ti.core.is_integral(dt2):
        ys = list(range(4))
    else:
        ys = [0.2, 0.4, 0.8, 1.0]

    for x, y in zip(xs, ys):
        assert func(x, y) == approx(x * y)


_test_binary_func_ret(ti.i32, ti.f32, ti.f32)
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-qcgkx5tj
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 2762aeef, python 3.8.2
Before preprocessing:
@ti.kernel
def func(a: dt1, b: dt2) ->dt3:
    return a * b

After preprocessing:
def func():
  ti.decl_scalar_ret(dt3)
  a = ti.decl_scalar_arg(dt1)
  b = ti.decl_scalar_arg(dt2)
  ti.core.create_kernel_return(ti.cast(ti.Expr(a * b), dt3).ptr)

Traceback (most recent call last):
  File "tst.my.py", line 27, in <module>
    _test_binary_func_ret(ti.i32, ti.f32, ti.f32)
  File "tst.my.py", line 24, in _test_binary_func_ret
    assert func(x, y) == approx(x * y)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 497, in wrapped
    return primal(*args, **kwargs)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 427, in __call__
    self.materialize(key=key, args=args, arg_features=arg_features)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 307, in materialize
    taichi_kernel = taichi_kernel.define(taichi_ast_generator)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 304, in taichi_ast_generator
    compiled()
  File "tst.my.py", line 18, in func
    if ti.core.is_integral(dt2):
NameError: name 'dt3' is not defined

I checked func.__globals__, why dt1 and dt2 works and dt3 don't?

@archibate
Copy link
Collaborator Author

llvm ok, ogl many other fail, fix tmr.

@yuanming-hu
Copy link
Member

I checked func.__globals__, why dt1 and dt2 works and dt3 don't?

Maybe try inserting the return type into globals? We did that for the argument types:

for i, arg in enumerate(func_body.args.args):
anno = arg.annotation
if isinstance(anno, ast.Name):
global_vars[anno.id] = self.arguments[i]

@archibate archibate requested a review from yuanming-hu May 8, 2020 03:52
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you! Feel free to merge after CI passes! (pls ignore my previous message)

@archibate
Copy link
Collaborator Author

@k-ye I'm merging this now, feel free to add another PR for metal!

@archibate archibate merged commit bbddaa1 into taichi-dev:master May 8, 2020
archibate added a commit to archibate/taichi that referenced this pull request May 8, 2020
archibate added a commit to archibate/taichi that referenced this pull request May 8, 2020
archibate added a commit that referenced this pull request May 11, 2020
…839)

* [IR] upgrade constant fold pass, now supports more op types

* fix test

* add divs

* [skip ci] tmp work save

* [skip ci] ker func nothing

[skip ci] why recursive

[skip ci] fantl why recursive!

[skip ci] how to insert stmt?

[skip ci] how

* [skip ci] enforce code format

* IT WORKSga!

* [skip ci] tmp save

* FIX BUG

* revert

* just before cache

* C A C H E !

fix cache

really did fix

fix fix

* revert it!

really no no_cp2o

* [skip ci] tmp save

* try mutex

* fix test by move to program

* more types

* [skip ci] nit clean

* [skip ci] assert prog != null

* [skip ci] apply to all type

* [skip ci] also unary ops

* add unique Program for test_cpp

* [skip ci] add unary cast support & clean nit (still module == null)

* give up LLVM, go for oPENgl

* fix typo

* fix by materialize_layout

* back to cpu

* [skip ci] clean

* [skip ci] tmp save

* [skip ci] fix by opengl

* [skip ci] why 233

* [skip ci] 0 save

* [skip ci] fake die loop

* [skip ci] add todo

* use KernelReturnStmt from #917

fix

* [skip ci] try CurrentKernelGuard

* [skip ci] fix types

* [skip ci] add type to lang_util

* [skip ci] revert 5c1b88 (fix types)

* refuse jit-cfd for cast

* fix test_types

* disable cfd when debug=True

* [skip ci] nit

* fix ctx override

* [skip ci] clean

* [skip ci] nit app

* [skip ci] app

* using JITEvaluatorIdType = int

* re-add !!is_binary

* [skip ci] fmt

* [skip ci] upgrade JITEValuatorIdType

* fix

fix comment

* [skip ci] move to advanced_optimization

Co-authored-by: Taichi Gardener <taichigardener@gmail.com>
Co-authored-by: xumingkuan <xumingkuan0721@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants