Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: generic: sycl: lnorm Intel GPU precision issues #2071

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kala855
Copy link
Contributor

@kala855 kala855 commented Sep 2, 2024

Description

When used in Intel PVC, the layer normalization SYCL kernel implementation faces some precision issues in variance computation.

create: --lnorm --engine=gpu --inplace=true 30x300
run: --lnorm --engine=gpu --inplace=true 30x300
[   0][VAR][0] exp_f32:   0.0202823 exp:   0.0202823 got:   0.0202823 diff:1.86265e-09 rdiff:9.1836e-08
[   1][VAR][1] exp_f32:   0.0804944 exp:   0.0804944 got:   0.0804944 diff:7.45058e-09 rdiff:9.25603e-08
[   2][VAR][2] exp_f32:    0.327062 exp:    0.327062 got:    0.327062 diff:2.98023e-08 rdiff:9.11213e-08
[   6][VAR][6] exp_f32:     82.4779 exp:     82.4779 got:     82.4779 diff:7.62939e-06 rdiff:9.25023e-08
[  17][VAR][17] exp_f32:      1.3052 exp:      1.3052 got:      1.3052 diff:1.19209e-07 rdiff:9.1334e-08
[  22][VAR][22] exp_f32:   0.0829537 exp:   0.0829537 got:   0.0829537 diff:7.45058e-09 rdiff:8.98161e-08
[  23][VAR][23] exp_f32:    0.324289 exp:    0.324289 got:    0.324289 diff:2.98023e-08 rdiff:9.19006e-08
[  29][VAR][29] exp_f32:   0.0822034 exp:   0.0822034 got:   0.0822034 diff:7.45058e-09 rdiff:9.0636e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:7.62939e-06 err_max_rdiff:9.25603e-08 all_max_diff:7.62939e-06 all_max_rdiff:9.25603e-08
0:FAILED (errors:8 total:9060) __REPRO: --lnorm --engine=gpu --inplace=true 30x300
[  14][VAR][14] exp_f32: 2.68618e-05 exp: 2.68618e-05 got: 2.68618e-05 diff:1.81899e-12 rdiff:6.77165e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:1.81899e-12 err_max_rdiff:6.77165e-08 all_max_diff:1.81899e-12 all_max_rdiff:6.77165e-08
8471:FAILED (errors:1 total:75) __REPRO: --lnorm --engine=gpu --dt=f32:bf16 --tag=axb --stat_tag=abx --flags=CH 15x3_n"lnorm_ci_0d:0"

The previous are just a couple of failing examples.

As a proposal, the variance threshold is modified to pass the failing tests.

@kala855 kala855 requested a review from a team as a code owner September 2, 2024 13:46
@kala855 kala855 requested a review from a team September 2, 2024 13:49
@kala855
Copy link
Contributor Author

kala855 commented Sep 2, 2024

To give a little bit of additional context. I found that in this line. The division is making v_variance slightly different from the reference that is computed in benchdnn (there is a thr=0 for these cases). Maybe that is happening because of some compiler optimization.

@kala855 kala855 changed the title gpu: generic: sycl: lnorm: Intel GPU precision issues gpu: generic: sycl: lnorm Intel GPU precision issues Sep 2, 2024
Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an improvement. However I would just set the threshold to 5e-7 for all the cases, as 0 does not make sense as soon as floating point calculations are involved.

@mgouicem
Copy link
Contributor

mgouicem commented Sep 3, 2024

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device?
We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

@vpirogov
Copy link
Member

vpirogov commented Sep 3, 2024

@kala855, benchdnn is very sensitive to numerical issues by design. So in cases like this it's important to understand where the difference is coming from before considering the threshold change.

@kala855
Copy link
Contributor Author

kala855 commented Sep 13, 2024

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

We have been checking carefully these days to see what is happening here. Doing a comparison between the assembly generated by icpx on the OCL and SYCL versions of the implementations we found that:

  1. The OCL version generates an instruction math.invm which is IEEE standard compliant and is used to compute the division.
  2. The SYCL version generates an instruction math.inv which is not IEEE standard compliant and is the instruction that generates the precision issues.
  3. We were looking at different compiler flags to try to generate the same behavior on the SYCL side but was not possible.
  4. We are thinking that the problem is due to some compiler bug that does not take into account the flags to generate the expected code. (-fno-fast-math, -ffp-model=precise, etc.)
  5. Info about math.inv and math.invm instructions could be found here pages 24 and 26 respectively.

Any suggestions or feedback will be more than welcome. Thanks.

@mgouicem @t4c1 @sgeor255

@mgouicem
Copy link
Contributor

Thanks for checking. I would suggest:

  • to open an issue against the sycl compiler with your findings,
  • add a condition on that threshold increase on DNNL_WITH_SYCL
  • add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

@kala855 kala855 force-pushed the kala855/lnorm-var branch 2 times, most recently from afef1ae to 030a62b Compare September 30, 2024 11:15
@kala855
Copy link
Contributor Author

kala855 commented Sep 30, 2024

Thanks for checking. I would suggest:

  • to open an issue against the sycl compiler with your findings,
  • add a condition on that threshold increase on DNNL_WITH_SYCL
  • add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

Hi @mgouicem I did what you mentioned to try to get the SYCL lnorm benchdnn tests working. The issue was opened and a workaround was pushed in this PR. Thanks for your feedback.

Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I still believe it would be better to have nonzero threshold for all cases.

@mgouicem mgouicem added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic labels Oct 9, 2024
@kala855
Copy link
Contributor Author

kala855 commented Oct 9, 2024

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

@github-actions github-actions bot removed the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Oct 9, 2024
Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe we could mention in the comment before the option that this is specifically required for Intel devices.

@@ -84,9 +84,13 @@ status_t ref_batch_normalization_fwd_t::init(impl::engine_t *engine) {
= ::sycl::get_kernel_id<batch_normalization_fwd_kernel_vec_t>();
CHECK(create_kernel(engine, kid, &kernel_));
} else {
// Enabling the IEEE div compliant implementation
setenv("SYCL_PROGRAM_COMPILE_OPTIONS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not thread-safe to use setenv (e.g. another concurrent call to the same primitive might unset it after this one sets it).

const auto kid = ::sycl::get_kernel_id<
batch_normalization_fwd_kernel_vec_t1>();
CHECK(create_kernel(engine, kid, &kernel_));
unsetenv("SYCL_PROGRAM_COMPILE_OPTIONS");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here it does not restore the env var to its original value.

@mgouicem
Copy link
Contributor

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

Is there a programmatic way to control the same knob (that does not involve envvar)? It is typically not thread-safe to modify environment variables and this should be avoided as much as possible IMHO.

|| (kind == VAR && prb->dir & FLAG_FWD))
? 5e-7
: 0);
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed with the new option given to the kernel?

Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current PR relies on non thread-safe environment variable setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants