Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenGL backend (stage 2.2.2) #545

Merged
merged 48 commits into from
Mar 15, 2020
Merged

OpenGL backend (stage 2.2.2) #545

merged 48 commits into from
Mar 15, 2020

Conversation

archibate
Copy link
Collaborator

Related issue id = #492

@archibate archibate force-pushed the g2 branch 5 times, most recently from d51ba05 to 2f9cd96 Compare February 28, 2020 13:49
@archibate
Copy link
Collaborator Author

@archibate requested a review from @yuanming-hu

@archibate
Copy link
Collaborator Author

Any idea?

taichi/codegen/codegen_opengl.cpp Outdated Show resolved Hide resolved
taichi/codegen/codegen_opengl.cpp Outdated Show resolved Hide resolved
@yuanming-hu
Copy link
Member

Sorry about the delay. I was busy adding CUDA support for Windows.

@archibate
Copy link
Collaborator Author

Did you see any potential bug in this stage? I always get a material[i] = 2438xxxx7 (sounds INT_MAX) in mpm128.py after all done...

@archibate archibate requested a review from yuanming-hu March 1, 2020 11:27
@archibate
Copy link
Collaborator Author

Go for ast now..

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.

@yuanming-hu
Copy link
Member

Did you see any potential bug in this stage? I always get a material[i] = 2438xxxx7 (sounds INT_MAX) in mpm128.py after all done...

Would it be possible to pinpoint which stage (SNode writer, kernel print, SNodeReader etc) caused the INT_MAX result?

@archibate
Copy link
Collaborator Author

archibate commented Mar 2, 2020

Did you see any potential bug in this stage? I always get a material[i] = 2438xxxx7 (sounds INT_MAX) in mpm128.py after all done...

Would it be possible to pinpoint which stage (SNode writer, kernel print, SNodeReader etc) caused the INT_MAX result?

Can't, since mpm128 can't be run until the last stage. And this error isn't a SIGABRT or python-scope Exception, I can't trace its origin very well...
But I guess it might be related to range for or floordiv:

@ti.kernel
def reset():
  group_size = n_particles // 3
  for i in range(n_particles):
    x[i] = [ti.random() * 0.2 + 0.3 + 0.10 * (i // group_size), ti.random() * 0.2 + 0.05 + 0.32 * (i // group_size)]
    material[i] = i // group_size # problem is very likely to be raised from here
    # ...

I also found that sometimes it crashes, sometimes it don't. And randomly but not so random that the fault are usually connected on time... (sorry for my bad English :) just like: FFFFFF...FFF....F..FFFFF.F (layout be time order) Also seems related to shell restart...(?)

This may due to we didn't zero-initialize a variable somewhere that it just depend on initial stack value. Could you help me test this on MSVC? Windows like to initialize stack as 0xcc on startup, so if you get constant error FFFFFFF there, then the explaination may be true...

@archibate
Copy link
Collaborator Author

archibate commented Mar 2, 2020

(Remember to test non-GL build and think twice before merge)

@archibate archibate requested a review from yuanming-hu March 2, 2020 16:24
@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 2, 2020

A seemingly stupid yet extremely effective way to debug issues like these: simplify the program as much as you can while keeping the bug reproducible. Ultimately you will likely get a Python file within ~10 lines of code to trigger the bug. This will help you narrow down the range of possible sources of bugs. I've been using this technique very frequently to debug mysterious compiler bugs...

@archibate
Copy link
Collaborator Author

archibate commented Mar 4, 2020

Go bed now, do we need a stage 3 or just done everything here (like minor bug fix / feature add)?

@yuanming-hu
Copy link
Member

Thansk for your hard work!

As long as we can pass the corresponding tests, then we are good :-) The initial release should also show some performance boost compared with the CPU version. We can leave further performance optimizations to the future versions.

@archibate
Copy link
Collaborator Author

Stage 3 merged! Check if test pass...

@archibate
Copy link
Collaborator Author

0:44(6): warning: identifier `func_c19_0__grad0' uses reserved `__' string

Have no idea why __ is reserved...

@yuanming-hu
Copy link
Member

0:44(6): warning: identifier `func_c19_0__grad0' uses reserved `__' string

Have no idea why __ is reserved...

The double underscore is weird anyway - let's remove the first/second underscore then.

@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 5, 2020

kernel_name = "{}_c{}_{}_{}".format(self.func.__name__, self.kernel_counter, key[1], grad_suffix)

@yuanming-hu
Copy link
Member

The format template should be {}_c{}_{}{} instead of {}_c{}_{}_{} :-)

@archibate
Copy link
Collaborator Author

I think here is done if all tests passed.
Also want to compare archibate:opengl with g2 to make sure that no commits are lost.

@archibate
Copy link
Collaborator Author

archibate commented Mar 6, 2020

How to make spdlog submodule happy?
'spdlog/fmt/bundled/color.h' file not found after cloned spdlog to there.

@archibate
Copy link
Collaborator Author

archibate commented Mar 7, 2020

A seemingly stupid yet extremely effective way to debug issues like these: simplify the program as much as you can while keeping the bug reproducible. Ultimately you will likely get a Python file within ~10 lines of code to trigger the bug. This will help you narrow down the range of possible sources of bugs. I've been using this technique very frequently to debug mysterious compiler bugs...

When I move group_size outside, the fault no longer (never) occur. So it may be related to multi-offload support or GlobalTmpStmt.

Also found if I remove any of x[i], v[i]... initialization, this fault no longer (never) occur. Super puzzled.
Fault still occur if x[i] = [2, 3].

@archibate
Copy link
Collaborator Author

archibate commented Mar 14, 2020

TODO: do good test and break-downs on this branch with master. Don't merge me yet. See you trr.

@archibate
Copy link
Collaborator Author

can't merge anymore. almost forget everything did in gl.

@archibate
Copy link
Collaborator Author

all test passing..

@yuanming-hu
Copy link
Member

all test passing..

Great!! Could you merge your latest commit in this PR with the current master branch?

@archibate
Copy link
Collaborator Author

Just take all from g2 and drop all from master, hope this works..

@archibate
Copy link
Collaborator Author

archibate commented Mar 15, 2020

Didn't work. Lot of errors generated.

@archibate
Copy link
Collaborator Author

Giving up now.

@archibate
Copy link
Collaborator Author

/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:933:21: error: C++ requires a type specifier for all declarations
  KernelGen codegen(kernel_, kernel_name_, struct_compiled_, global_tmps_buffer_size_);
                    ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:933:30: error: C++ requires a type specifier for all declarations
  KernelGen codegen(kernel_, kernel_name_, struct_compiled_, global_tmps_buffer_size_);
                             ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:933:44: error: C++ requires a type specifier for all declarations
  KernelGen codegen(kernel_, kernel_name_, struct_compiled_, global_tmps_buffer_size_);
                                           ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:933:62: error: unknown type name 'global_tmps_buffer_size_'
  KernelGen codegen(kernel_, kernel_name_, struct_compiled_, global_tmps_buffer_size_);
                                                             ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:934:3: error: unknown type name 'codegen'
  codegen.run(*prog_->snode_root);
  ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:934:10: error: expected member name or ';' after declaration specifiers
  codegen.run(*prog_->snode_root);
  ~~~~~~~^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:935:3: error: 'auto' not allowed in non-static struct member
  auto compiled = codegen.get_compiled_program();
  ^~~~
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:936:3: error: expected member name or ';' after declaration specifiers
  return [compiled](Context &ctx) {
  ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:942:2: error: expected ';' after struct
}
 ^
/home/bate/Develop/taichi/taichi/codegen/codegen_opengl.cpp:149:8: warning: private field 'has_rand_' is not used [-Wunused-private-field]
  bool has_rand_{false};

Now I'm completely lose control with these codes... Have no idea which stage is this on... The order is... all broken... what's more the good news is there still some unmerged commits in the old unstaged PR.. from the lesson learnt from GL: DON'T BREAK DOWN PRS ANYMORE! Only open-review-and-merge-in-time this method start from beginning can have benefits instead of wasting time :(

@yuanming-hu
Copy link
Member

How about merging 664e1bd with master? It seems to me that if you do that then the only file with conflict is codegen_opengl.cpp. Does fixing the merge conflicts, in that case, solve your compilation issues?

@archibate
Copy link
Collaborator Author

archibate commented Mar 15, 2020

Does fixing the merge conflicts, in that case, solve your compilation issues?

Didn't. The problem is, I cherry-picked commits in different timeline. And when merge, the timeline is broken, can't find a good order to make them all happy.

@archibate
Copy link
Collaborator Author

How about merging 664e1bd with master? It seems to me that if you do that then the only file with conflict is codegen_opengl.cpp.

True, only 1 file conflict, but 18 conflicts with 100+ lines.

@yuanming-hu
Copy link
Member

How about merging 664e1bd with master? It seems to me that if you do that then the only file with conflict is codegen_opengl.cpp.

True, only 1 file conflict, but 18 conflicts with 100+ lines.

I see. Fixing that file seems the only way we can go now, unfortunately. Sorry about the trouble. Let's keep future PRs smaller to avoid related issues.

@archibate
Copy link
Collaborator Author

archibate commented Mar 15, 2020

How about merging 664e1bd with master? It seems to me that if you do that then the only file with conflict is codegen_opengl.cpp.

True, only 1 file conflict, but 18 conflicts with 100+ lines.

I see. Fixing that file seems the only way we can go now, unfortunately. Sorry about the trouble. Let's keep future PRs smaller to avoid related issues.

Yeah you don't want to write any code about GL until all stages are merged thanks to the merge-into-master method.
No problem if we merge archibate:g1, archibate:g2 into archibate:opengl, I have write access to that branch, staging safely without requesting yuanming-hu, after all done do some final check and merge opengl into master, while small PRs are still trackable in/from archibate:opengl. I heard lots of great github repos are using this way to modularize development.

@archibate
Copy link
Collaborator Author

archibate commented Mar 15, 2020

Got many F in test now.. I need to re-make these wandering commits that fix tests again.. why not go back and use plan B.. Happy coding is becoming happy merging now..

@archibate
Copy link
Collaborator Author

archibate commented Mar 15, 2020

Fixed, seems due to approx typo fix, need 1e-3.
Re-running ti test again, hopefully could done this.

@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 15, 2020

I understand things are getting slightly more troublesome than expected.

No problem if we merge archibate:g1, archibate:g2 into archibate:opengl, I have write access to that branch, staging safely without requesting yuanming-hu, after all done do some final check and merge opengl into master, while small PRs are still trackable in/from archibate:opengl. I heard lots of great github repos are using this way to modularize development.

I feel like merging a huge archibate:opengl branch as a PR, in the end, will still lead to some merge conflicts.

Having smaller PRs into master gradually will

  • Allow reviewers to provide early feedback
  • Since PR is small,
    • Review can be done quickly
    • Reviewers' job is easier and bugs can be more easily spotted
    • Potential merge conflicts are minimized (since small PR merge is quick, master has not changed too much by the time the PR is approaved)
    • master history is at a good granularity (not too big or too small)

What do you think?

@archibate
Copy link
Collaborator Author

I see, fault was I didn't even wait for your review/merge and keep pushing into opengl, causing the PR huge and breakless, i.e. it should be small PRs at the first place, rather than breaking down an existing huge PR.

@yuanming-hu
Copy link
Member

Yeah, I should have let you know the PR is getting large, at an earlier time...
I think we both learned something from this :-) Thanks for your hard work and patience.

I'll wait for CI to pass - meanwhile, do we still need taichi/perf.h? We already have the TI_PROFILER macro.

@archibate
Copy link
Collaborator Author

I T P A S S E D !
Merge when you think it's time for it.
BTW, can you make ti test only test for ti.opengl not for ti.x64? Can save lots of time for alter backend developers.

@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 15, 2020

I T P A S S E D !
Merge when you think it's time for it.

Awesome!!

BTW, can you make ti test only test for ti.opengl not for ti.x64? Can save lots of time for alter backend developers.

Great idea!! What do you think should be a good CLI for it? Free free to open up a stand-alone issue for this. (This PR now has 50+ conversations, LOL)

Btw, currently python/taichi/main.py for the ti command is slightly messy. We should consider using argparse instead of series of if-elif, if that makes sense.

@yuanming-hu
Copy link
Member

Finally, we can merge this in! Thank you so much for your contribution. It's absolutely a giant step :-)

Going to sleep now. It's almost 3 am here. See you tmr.

@yuanming-hu yuanming-hu merged commit f5bdf09 into taichi-dev:master Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants