Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi thread crash #68

Closed
songmaotian opened this issue Apr 22, 2016 · 21 comments
Closed

multi thread crash #68

songmaotian opened this issue Apr 22, 2016 · 21 comments

Comments

@songmaotian
Copy link

songmaotian commented Apr 22, 2016

Hello,
It seems libblis is not thread safe, I have a gemm invocation

static inline void matmul(cv::Mat &c, cv::Mat &a, cv::Mat b)
{
        float   alphap = 1.0;
        //since beta is zero, we don't need to init c to zero
        float   betap = 0.0;

        cntx_t cntx;
        bli_gemm_cntx_init(&cntx);

        bli_sgemm(BLIS_NO_TRANSPOSE, BLIS_NO_TRANSPOSE, a.rows, b.cols, a.cols,
                &alphap,
                (float *)a.data, a.cols, 1,
                (float *)b.data, b.cols, 1,
                &betap,
                (float *)c.data, b.cols, 1, &cntx);
        bli_gemm_cntx_finalize(&cntx);
}

and we have several thread will invoke the matmul, then it will crash as follow, the parameter p get changed to 0, if I change the program to run only one thread to invoke the matmul, it will be all right.
#0 0x0000000000545abc in bli_spackm_6xk_ref (conja=BLIS_NO_CONJUGATE, n=25, kappa=0x7fffec000dc0, a=0x7fffb8091540, inca=25, lda=1, p=0x0, ldp=6)

at frame/1m/packm/ukernels/bli_packm_cxk_ref.c:414

#1 0x00000000004ee461 in bli_spackm_cxk (conja=BLIS_NO_CONJUGATE, panel_dim=6, panel_len=25, kappa=0x7fffec000dc0, a=0x7fffb8091540, inca=25, lda=1, p=0x0, ldp=6,

cntx=0x7fffd247c190) at frame/1m/packm/bli_packm_cxk.c:216

#2 0x00000000004c4806 in bli_spackm_struc_cxk (strucc=BLIS_GENERAL, diagoffc=0, diagc=BLIS_NONUNIT_DIAG, uploc=BLIS_DENSE, conjc=BLIS_NO_CONJUGATE,

schema=BLIS_PACKED_COL_PANELS, invdiag=0, m_panel=25, n_panel=6, m_panel_max=25, n_panel_max=6, kappa=0x7fffec000dc0, c=0x7fffb8091540, rs_c=1, cs_c=25, p=0x0, 
rs_p=6, cs_p=1, is_p=1, cntx=0x7fffd247c190) at frame/1m/packm/bli_packm_struc_cxk.c:255

#3 0x00000000004bbbe6 in bli_spackm_blk_var1 (strucc=BLIS_GENERAL, diagoffc=0, diagc=BLIS_NONUNIT_DIAG, uploc=BLIS_DENSE, transc=BLIS_NO_TRANSPOSE,

schema=BLIS_PACKED_COL_PANELS, invdiag=0, revifup=0, reviflo=0, m=25, n=1936, m_max=25, n_max=1938, kappa=0x7fffec000dc0, c=0x7fffb8091540, rs_c=1, cs_c=25, p=0x0, 
rs_p=6, cs_p=1, is_p=1, pd_p=6, ps_p=150, packm_ker=0x4c46f9 <bli_spackm_struc_cxk>, cntx=0x7fffd247c190, thread=0x7fffb8007600)
at frame/1m/packm/bli_packm_blk_var1.c:668

#4 0x00000000004bb133 in bli_packm_blk_var1 (c=0x7fffd2479e50, p=0x7fffd24798e0, cntx=0x7fffd247c190, t=0x7fffb8007600) at frame/1m/packm/bli_packm_blk_var1.c:234
#5 0x00000000004aed11 in bli_packm_int (a=0x7fffd2479e50, p=0x7fffd24798e0, cntx=0x7fffd247c190, cntl=0x7fffec002c00, thread=0x7fffb8007600)

at frame/1m/packm/bli_packm_int.c:125

#6 0x00000000004b23ca in bli_gemm_blk_var1f (a=0x7fffd2479d80, b=0x7fffd2479e50, c=0x7fffd2479f20, cntx=0x7fffd247c190, cntl=0x7fffec002d00, thread=0x7fffb8007660)

at frame/3/gemm/bli_gemm_blk_var1f.c:79

#7 0x00000000004488b2 in bli_gemm_int (alpha=0x7c66a0 <BLIS_ONE>, a=0x7fffd247a160, b=0x7fffd247a230, beta=0x7c66a0 <BLIS_ONE>, c=0x7fffd247a090, cntx=0x7fffd247c190,

cntl=0x7fffec002d00, thread=0x7fffb8007660) at frame/3/gemm/bli_gemm_int.c:154

#8 0x00000000004b304b in bli_gemm_blk_var3f (a=0x7fffd247a530, b=0x7fffd247a600, c=0x7fffd247a6d0, cntx=0x7fffd247c190, cntl=0x7fffec002da0, thread=0x7fffb80c0ae0)

at frame/3/gemm/bli_gemm_blk_var3f.c:121

#9 0x00000000004488b2 in bli_gemm_int (alpha=0x7c66a0 <BLIS_ONE>, a=0x7fffd247a840, b=0x7fffd247a910, beta=0x7c66a0 <BLIS_ONE>, c=0x7fffd247a9e0, cntx=0x7fffd247c190,

cntl=0x7fffec002da0, thread=0x7fffb80c0ae0) at frame/3/gemm/bli_gemm_int.c:154

#10 0x00000000004b2b28 in bli_gemm_blk_var2f (a=0x7fffd247ace0, b=0x7fffd247adb0, c=0x7fffd247ae80, cntx=0x7fffd247c190, cntl=0x7fffec002e40, thread=0x7fffb80c0ca0)

at frame/3/gemm/bli_gemm_blk_var2f.c:123

#11 0x00000000004488b2 in bli_gemm_int (alpha=0x7fffd247bcf0, a=0x7fffd247b0c0, b=0x7fffd247b190, beta=0x7fffd247bf60, c=0x7fffd247b260, cntx=0x7fffd247c190,

cntl=0x7fffec002e40, thread=0x7fffb80c0ca0) at frame/3/gemm/bli_gemm_int.c:154

#12 0x0000000000423d60 in bli_level3_thread_decorator (n_threads=1, func=0x447f75 <bli_gemm_int>, alpha=0x7fffd247bcf0, a=0x7fffd247b0c0, b=0x7fffd247b190,

beta=0x7fffd247bf60, c=0x7fffd247b260, cntx=0x7fffd247c190, cntl=0x7fffec002e40, thread=0x7fffb80070c0) at frame/base/bli_threading.c:92

#13 0x0000000000447f5a in bli_gemm_front (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190,

cntl=0x7fffec002e40) at frame/3/gemm/bli_gemm_front.c:86

#14 0x0000000000429be5 in bli_gemmnat (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/ind/oapi/bli_l3_nat_oapi.c:80

#15 0x000000000049242b in bli_gemmind (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/ind/oapi/bli_l3_ind_oapi.c:59

#16 0x000000000044701e in bli_gemm_ex (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/3/bli_l3_oapi.c:74

#17 0x0000000000419e3c in bli_sgemm (transa=BLIS_NO_TRANSPOSE, transb=BLIS_NO_TRANSPOSE, m=1936, n=16, k=25, alpha=0x7fffd247c188, a=0x7fffb8091540, rs_a=25, cs_a=1,

b=0x7fffb8003790, rs_b=16, cs_b=1, beta=0x7fffd247c18c, c=0x7fffb8040e30, rs_c=16, cs_c=1, cntx=0x7fffd247c190) at frame/3/bli_l3_tapi.c:93
@songmaotian
Copy link
Author

songmaotian commented Apr 22, 2016

the problem is in bli_gemm_blk_var1f, for init the b_pack_s
bli_packm_init( b, &b_pack_s,
cntx, cntl_sub_packm_b( cntl ) );

after init the buffer is 0, the following patch seems to work for me
sorry, I didn't get the reason

diff --git a/frame/base/bli_mem.c b/frame/base/bli_mem.c
index a199130..ead98dc 100644
--- a/frame/base/bli_mem.c
+++ b/frame/base/bli_mem.c
@@ -58,7 +58,8 @@ void bli_mem_acquire_m( siz_t req_size,
// Make sure the API is initialized.
bli_mem_init();

  • if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • //if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • if (1)
    {
    // For general-use buffer requests, such as those used by level-2
    // operations, using bli_malloc() is sufficient.
    @@ -159,7 +160,8 @@ void bli_mem_release( mem_t* mem )
    // Extract the buffer type so we know what kind of memory was allocated.
    buf_type = bli_mem_buf_type( mem );
  • if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • //if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • if (1)
    {
    void* buf_sys = bli_mem_buf_sys( mem );

@@ -243,7 +245,8 @@ siz_t bli_mem_pool_size( packbuf_t buf_type )
{
siz_t r_val;

  • if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • //if ( buf_type == BLIS_BUFFER_FOR_GEN_USE )
  • if (1)
    {
    // We don't (yet) track the amount of general-purpose
    // memory that is currently allocated.

@fgvanzee
Copy link
Member

@songmaotian Thanks for this bug report. It appears that your patch disables use of the internal memory allocator in favor of a regular malloc()-style allocation.

I noticed that in the code snippet you gave in your first post, you did not show any call to bli_init(). Are you calling this function prior to using any other BLIS functions?

@songmaotian
Copy link
Author

that's true, I disabled the mempool case. I have compare the result, it's right, so I didn't dig the problem deeper, sorry for that.
yes, I have called bli_init.

@fgvanzee
Copy link
Member

@songmaotian I'll have to look at this more closely to figure out why the memory allocator is not thread-safe. (It is supposed to be thread-safe.)

@tlrmchlsmth
Copy link
Member

BLIS isonly threadsafe right now if you compile BLIS with the same threading model you use to multithread.
So if I compile BLIS with openmp, and then call multiple BLIS dgemms with multiple openmp threads, everything will be fine.
But if I compile BLIS with openmp, and then call multiple BLIS dgemms with multiple ptheads, things will break

@songmaotian https://github.com/songmaotian What threading model are you using and how are you compiling BLIS?

On Apr 26, 2016, at 12:50 PM, Field G. Van Zee notifications@github.com wrote:

@songmaotian https://github.com/songmaotian I'll have to look at this more closely to figure out why the memory allocator is not thread-safe. (It is supposed to be thread-safe.)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #68 (comment)

@songmaotian
Copy link
Author

right now I am using version a30ccbc
in the top dir, there's a bli_config.h, it says,

38 #if 0
39 #define BLIS_ENABLE_PTHREADS
40 #endif
41
42 #if 0
43 #define BLIS_ENABLE_OPENMP
44 #endif

it seems I haven't enable multi threading in the lib (it's same for x86(auto) and armv8)

@devinamatthews
Copy link
Member

Use something like './configure -t pthreads '

On April 26, 2016 7:11:22 PM CDT, songmaotian notifications@github.com wrote:

right now I am using version a30ccbc
in the top dir, there's a bli_config.h, it says,

38 #if 0
39 #define BLIS_ENABLE_PTHREADS
40 #endif
41
42 #if 0
43 #define BLIS_ENABLE_OPENMP
44 #endif

it seems I haven't enable multi threading in the lib (it's same for
x86(auto) and armv8)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#68 (comment)

@songmaotian
Copy link
Author

@fgvanzee thanks for that, thanks for all your great work, I have been so busy that I have no time to dig deeper, sorry for that

@fgvanzee
Copy link
Member

fgvanzee commented Apr 27, 2016

@tlrmchlsmth @devinamatthews Thanks for chiming in, guys.
@songmaotian Glad we could help.

@fgvanzee
Copy link
Member

@songmaotian I would like to close this issue. Could you confirm that enabling multithreading in BLIS solved your problem?

@songmaotian
Copy link
Author

I am using it on android, "./configure -t pthreads" can't be built since the pthread barrier isn't implemented on android.

@jeffhammond
Copy link
Member

@songmaotian Pthread barrier is an extension that isn't supported on Mac either.

This is one workaround. You might try it. Since copying random code from the Internet is a licensing risk, someone should write a clean-sheet implementation for BLIS.

The other option is to write a barrier using GCC or C11 atomics. GCC atomics means the intrinsics that are supported by many compilers, including Clang, Intel, IBM, and Cray (partially - see my OpenPA fork for exceptions). However, such a barrier is likely to spin-wait, which will perform terribly in the presence of oversubscription, whereas the Pthread-based approaches are likely to not do this.

@devinamatthews
Copy link
Member

devinamatthews commented May 26, 2016 via email

@jeffhammond
Copy link
Member

@devinamatthews

@jedbrown has asserted that Pthread barrier sucks many times, and I believe him, at least in the presence of general purpose operating systems.

One part of the definition that may interfere with performance is the following:

A thread that has blocked on a barrier shall not prevent any unblocked thread that is eligible to use the same processing resources from eventually making forward progress in its execution. Eligibility for processing resources shall be determined by the scheduling policy.

This forward progress requirement means that pthread_barrier_wait likely requires a context switch.

In theory, if one has an operating system that does not oversubscribe the hardware threads, forward progress may be satisfied without a context switch, but I'm not aware of e.g. Blue Gene optimizing for this.

I assume that BLIS does not use an OpenMP barrier because it is both a memory and execution barrier, and you only need the latter. Is that correct?

@devinamatthews
Copy link
Member

We use a spin-barrier. We can't use #omp barrier because we need to
block on sub-communicators.

On 5/27/16 12:30 PM, Jeff Hammond wrote:

@devinamatthews https://github.com/devinamatthews

@jedbrown https://github.com/jedbrown has asserted that Pthread
barrier sucks many times, and I believe him, at least in the presence
of general purpose operating systems.

One part of the definition
http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_barrier_wait.html
that may interfere with performance is the following:

/A thread that has blocked on a barrier shall not prevent any
unblocked thread that is eligible to use the same processing resources
from eventually making forward progress in its execution. Eligibility
for processing resources shall be determined by the scheduling policy./

This forward progress requirement means that |pthread_barrier_wait|
likely requires a context switch.

In theory, if one has an operating system that does not oversubscribe
the hardware threads, forward progress may be satisfied without a
context switch, but I'm not aware of e.g. Blue Gene optimizing for this.

I assume that BLIS does not use an OpenMP barrier because it is both a
memory and execution barrier, and you only need the latter. Is that
correct?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#68 (comment), or
mute the thread
https://github.com/notifications/unsubscribe/AFAMoY4KLSXWWv-uM-IV8g3K_2BF4jT8ks5qFyoygaJpZM4INYba.

@jedbrown
Copy link
Contributor

@jeffhammond Beyond oversubscription, pthread barrier's problem is that threads don't have identity; the barrier completes after any count threads call pthread_barrier_wait. In practice, this results in an O(p) algorithm instead of O(log p). With a thread communicator, you can use a scalable barrier algorithm because you have identity. ConcurrencyKit offers scalable barriers by having a "subscribe" operation that gives participants their identity when they later participate in the barrier. Subscription is O(p), but the barrier itself is O(log P).

@tlrmchlsmth
Copy link
Member

OK it sounds like we should just change it so that BLIS never uses pthread barriers unless (maybe) someone requests them specifically at configure time. Any objections? @fgvanzee?

@fgvanzee
Copy link
Member

@tlrmchlsmth Without objection. On matters such as these, I defer to those with more expertise.

@jeffhammond
Copy link
Member

+1 to no pthread barriers by default. They might be useful for functional
testing in the oversubscribed case, although other pthread constructs may
also be able to solve that case.

Jeff Hammond
jeff.science@gmail.com
http://jeffhammond.github.io/

@tlrmchlsmth
Copy link
Member

OK there are no pthread barriers by default anymore.

I merged in #81 while I was at it.

@devinamatthews
Copy link
Member

AFAICT this is "fixed" now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants