gpu: generic: sycl: lnorm Intel GPU precision issues #2071

kala855 · 2024-09-02T13:46:25Z

Description

When used in Intel PVC, the layer normalization SYCL kernel implementation faces some precision issues in variance computation.

create: --lnorm --engine=gpu --inplace=true 30x300
run: --lnorm --engine=gpu --inplace=true 30x300
[   0][VAR][0] exp_f32:   0.0202823 exp:   0.0202823 got:   0.0202823 diff:1.86265e-09 rdiff:9.1836e-08
[   1][VAR][1] exp_f32:   0.0804944 exp:   0.0804944 got:   0.0804944 diff:7.45058e-09 rdiff:9.25603e-08
[   2][VAR][2] exp_f32:    0.327062 exp:    0.327062 got:    0.327062 diff:2.98023e-08 rdiff:9.11213e-08
[   6][VAR][6] exp_f32:     82.4779 exp:     82.4779 got:     82.4779 diff:7.62939e-06 rdiff:9.25023e-08
[  17][VAR][17] exp_f32:      1.3052 exp:      1.3052 got:      1.3052 diff:1.19209e-07 rdiff:9.1334e-08
[  22][VAR][22] exp_f32:   0.0829537 exp:   0.0829537 got:   0.0829537 diff:7.45058e-09 rdiff:8.98161e-08
[  23][VAR][23] exp_f32:    0.324289 exp:    0.324289 got:    0.324289 diff:2.98023e-08 rdiff:9.19006e-08
[  29][VAR][29] exp_f32:   0.0822034 exp:   0.0822034 got:   0.0822034 diff:7.45058e-09 rdiff:9.0636e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:7.62939e-06 err_max_rdiff:9.25603e-08 all_max_diff:7.62939e-06 all_max_rdiff:9.25603e-08
0:FAILED (errors:8 total:9060) __REPRO: --lnorm --engine=gpu --inplace=true 30x300

[  14][VAR][14] exp_f32: 2.68618e-05 exp: 2.68618e-05 got: 2.68618e-05 diff:1.81899e-12 rdiff:6.77165e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:1.81899e-12 err_max_rdiff:6.77165e-08 all_max_diff:1.81899e-12 all_max_rdiff:6.77165e-08
8471:FAILED (errors:1 total:75) __REPRO: --lnorm --engine=gpu --dt=f32:bf16 --tag=axb --stat_tag=abx --flags=CH 15x3_n"lnorm_ci_0d:0"

The previous are just a couple of failing examples.

As a proposal, the variance threshold is modified to pass the failing tests.

kala855 · 2024-09-02T13:55:53Z

To give a little bit of additional context. I found that in this line. The division is making v_variance slightly different from the reference that is computed in benchdnn (there is a thr=0 for these cases). Maybe that is happening because of some compiler optimization.

t4c1

Looks like an improvement. However I would just set the threshold to 5e-7 for all the cases, as 0 does not make sense as soon as floating point calculations are involved.

mgouicem · 2024-09-03T10:16:42Z

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device?
We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

vpirogov · 2024-09-03T16:37:54Z

@kala855, benchdnn is very sensitive to numerical issues by design. So in cases like this it's important to understand where the difference is coming from before considering the threshold change.

kala855 · 2024-09-13T08:10:26Z

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

We have been checking carefully these days to see what is happening here. Doing a comparison between the assembly generated by icpx on the OCL and SYCL versions of the implementations we found that:

The OCL version generates an instruction math.invm which is IEEE standard compliant and is used to compute the division.
The SYCL version generates an instruction math.inv which is not IEEE standard compliant and is the instruction that generates the precision issues.
We were looking at different compiler flags to try to generate the same behavior on the SYCL side but was not possible.
We are thinking that the problem is due to some compiler bug that does not take into account the flags to generate the expected code. (-fno-fast-math, -ffp-model=precise, etc.)
Info about math.inv and math.invm instructions could be found here pages 24 and 26 respectively.

Any suggestions or feedback will be more than welcome. Thanks.

@mgouicem @t4c1 @sgeor255

mgouicem · 2024-09-16T08:41:07Z

Thanks for checking. I would suggest:

to open an issue against the sycl compiler with your findings,
add a condition on that threshold increase on DNNL_WITH_SYCL
add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

kala855 · 2024-09-30T11:20:37Z

Thanks for checking. I would suggest:

to open an issue against the sycl compiler with your findings,

add a condition on that threshold increase on DNNL_WITH_SYCL

add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

Hi @mgouicem I did what you mentioned to try to get the SYCL lnorm benchdnn tests working. The issue was opened and a workaround was pushed in this PR. Thanks for your feedback.

t4c1

LGTM, but I still believe it would be better to have nonzero threshold for all cases.

kala855 · 2024-10-09T13:53:12Z

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

t4c1

LGTM. Maybe we could mention in the comment before the option that this is specifically required for Intel devices.

mgouicem · 2024-10-10T07:56:30Z

src/gpu/generic/sycl/ref_batch_normalization.cpp

@@ -84,9 +84,13 @@ status_t ref_batch_normalization_fwd_t::init(impl::engine_t *engine) {
                = ::sycl::get_kernel_id<batch_normalization_fwd_kernel_vec_t>();
        CHECK(create_kernel(engine, kid, &kernel_));
    } else {
+        // Enabling the IEEE div compliant implementation
+        setenv("SYCL_PROGRAM_COMPILE_OPTIONS",


I believe this is not thread-safe to use setenv (e.g. another concurrent call to the same primitive might unset it after this one sets it).

mgouicem · 2024-10-10T07:56:57Z

src/gpu/generic/sycl/ref_batch_normalization.cpp

        const auto kid = ::sycl::get_kernel_id<
                batch_normalization_fwd_kernel_vec_t1>();
        CHECK(create_kernel(engine, kid, &kernel_));
+        unsetenv("SYCL_PROGRAM_COMPILE_OPTIONS");


Also, here it does not restore the env var to its original value.

mgouicem · 2024-10-10T07:58:24Z

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

Is there a programmatic way to control the same knob (that does not involve envvar)? It is typically not thread-safe to modify environment variables and this should be avoided as much as possible IMHO.

dzarukin · 2024-10-10T15:28:41Z

tests/benchdnn/lnorm/lnorm.cpp

+                       || (kind == VAR && prb->dir & FLAG_FWD))
+                            ? 5e-7
+                            : 0);
+#else


Is this still needed with the new option given to the kernel?

mgouicem

Current PR relies on non thread-safe environment variable setting.

kala855 requested a review from a team as a code owner September 2, 2024 13:46

kala855 requested a review from a team September 2, 2024 13:49

kala855 changed the title ~~gpu: generic: sycl: lnorm: Intel GPU precision issues~~ gpu: generic: sycl: lnorm Intel GPU precision issues Sep 2, 2024

t4c1 approved these changes Sep 3, 2024

View reviewed changes

sgeor255 approved these changes Sep 3, 2024

View reviewed changes

kala855 requested review from t4c1 and sgeor255 September 13, 2024 08:10

kala855 force-pushed the kala855/lnorm-var branch 2 times, most recently from afef1ae to 030a62b Compare September 30, 2024 11:15

t4c1 approved these changes Oct 4, 2024

View reviewed changes

mgouicem approved these changes Oct 4, 2024

View reviewed changes

mgouicem added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic labels Oct 9, 2024

gpu: generic: sycl: lnorm and bnorm Intel GPU precision issues

db382e0

kala855 force-pushed the kala855/lnorm-var branch from 030a62b to db382e0 Compare October 9, 2024 13:48

kala855 requested review from t4c1 and mgouicem October 9, 2024 13:49

github-actions bot removed the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Oct 9, 2024

t4c1 reviewed Oct 10, 2024

View reviewed changes

mgouicem reviewed Oct 10, 2024

View reviewed changes

gpu: generic: sycl: lnorm: Intel GPU precision issues

b6b1355

dzarukin reviewed Oct 10, 2024

View reviewed changes

mgouicem requested changes Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

kala855 commented Sep 2, 2024

kala855 commented Sep 2, 2024

t4c1 left a comment

mgouicem commented Sep 3, 2024

vpirogov commented Sep 3, 2024

kala855 commented Sep 13, 2024

mgouicem commented Sep 16, 2024

kala855 commented Sep 30, 2024

t4c1 left a comment

kala855 commented Oct 9, 2024

t4c1 left a comment

mgouicem Oct 10, 2024

mgouicem Oct 10, 2024

mgouicem commented Oct 10, 2024

dzarukin Oct 10, 2024

mgouicem left a comment

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Are you sure you want to change the base?

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Conversation

kala855 commented Sep 2, 2024

Description

kala855 commented Sep 2, 2024

t4c1 left a comment

Choose a reason for hiding this comment

mgouicem commented Sep 3, 2024

vpirogov commented Sep 3, 2024

kala855 commented Sep 13, 2024

mgouicem commented Sep 16, 2024

kala855 commented Sep 30, 2024

t4c1 left a comment

Choose a reason for hiding this comment

kala855 commented Oct 9, 2024

t4c1 left a comment

Choose a reason for hiding this comment

mgouicem Oct 10, 2024

Choose a reason for hiding this comment

mgouicem Oct 10, 2024

Choose a reason for hiding this comment

mgouicem commented Oct 10, 2024

dzarukin Oct 10, 2024

Choose a reason for hiding this comment

mgouicem left a comment

Choose a reason for hiding this comment