Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Adapt new lowering sequence for fdiv16 #109295

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Sep 19, 2024

The current lowering of fdiv16 can generate incorrectly rounded result in some cases. The new sequence was provided by the HW team, as shown below written in C++.

half fdiv(half a, half b) {
  float a32 = float(a);
  float b32 = float(b);
  float r32 = 1.0f / b32;
  float q32 = a32 * r32;
  float e32 = -b32 * q32 + a32;
  q32 = e32 * r32 + q32;
  e32 = -b32 * q32 + a32;
  float tmp = e32 * r32;
  uin32_t tmp32 = std::bit_cast<uint32_t>(tmp);
  tmp32 = tmp32 & 0xff800000;
  tmp = std::bit_cast<float>(tmp32);
  q32 = tmp + q32;
  half q16 = half(q32);
  q16 = div_fixup_f16(q16);
  return q16;
}

Fixes SWDEV-47760.

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @shiltian and the rest of your teammates on Graphite Graphite

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 19, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

The current lowering of fdiv16 can generate incorrectly rounded result in some
cases.

Fixes SWDEV-47760.


Patch is 459.85 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109295.diff

9 Files Affected:

  • (added) 8925731.diff (+4542)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+10-6)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+18-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll (+1028-624)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll (+62-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-fdiv.mir (+342-48)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.f16.ll (+44-10)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll (+57-10)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+499-171)
diff --git a/8925731.diff b/8925731.diff
new file mode 100644
index 00000000000000..a1f9fa92350375
--- /dev/null
+++ b/8925731.diff
@@ -0,0 +1,4542 @@
+From 8925731b320f6afbbd6a7d0b1b3520f616688aa5 Mon Sep 17 00:00:00 2001
+From: Shilei Tian <shilei.tian@amd.com>
+Date: Tue, 10 Sep 2024 15:38:37 -0400
+Subject: [PATCH] SWDEV-477608 - Adapt new lowering of fdiv16
+
+The current lowering of fdiv16 can generate incorrectly rounded result in some
+cases.
+
+Change-Id: I302365dbd9ceedfc364a25f7806798072b3b14af
+---
+
+diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+index e657f66..86d488e 100644
+--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
++++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+@@ -4902,14 +4902,18 @@
+ 
+   auto LHSExt = B.buildFPExt(S32, LHS, Flags);
+   auto RHSExt = B.buildFPExt(S32, RHS, Flags);
+-
+-  auto RCP = B.buildIntrinsic(Intrinsic::amdgcn_rcp, {S32})
++  auto NegRHSExt = B.buildFNeg(S32, RHSExt);
++  auto Rcp = B.buildIntrinsic(Intrinsic::amdgcn_rcp, {S32})
+                  .addUse(RHSExt.getReg(0))
+                  .setMIFlags(Flags);
+-
+-  auto QUOT = B.buildFMul(S32, LHSExt, RCP, Flags);
+-  auto RDst = B.buildFPTrunc(S16, QUOT, Flags);
+-
++  auto Quot = B.buildFMul(S32, LHSExt, Rcp);
++  auto Err = B.buildFMA(S32, NegRHSExt, Quot, LHSExt);
++  Quot = B.buildFMA(S32, Err, Rcp, Quot);
++  Err = B.buildFMA(S32, NegRHSExt, Quot, LHSExt);
++  auto Tmp = B.buildFMul(S32, Err, Rcp);
++  Tmp = B.buildAnd(S32, Tmp, B.buildConstant(S32, 0xff800000));
++  Quot = B.buildFAdd(S32, Tmp, Quot);
++  auto RDst = B.buildFPTrunc(S16, Quot, Flags);
+   B.buildIntrinsic(Intrinsic::amdgcn_div_fixup, Res)
+       .addUse(RDst.getReg(0))
+       .addUse(RHS)
+diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+index fd35314..ece40a3 100644
+--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
++++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+@@ -10816,19 +10816,25 @@
+     return FastLowered;
+ 
+   SDLoc SL(Op);
+-  SDValue Src0 = Op.getOperand(0);
+-  SDValue Src1 = Op.getOperand(1);
+-
+-  SDValue CvtSrc0 = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, Src0);
+-  SDValue CvtSrc1 = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, Src1);
+-
+-  SDValue RcpSrc1 = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, CvtSrc1);
+-  SDValue Quot = DAG.getNode(ISD::FMUL, SL, MVT::f32, CvtSrc0, RcpSrc1);
+-
++  SDValue LHS = Op.getOperand(0);
++  SDValue RHS = Op.getOperand(1);
++  SDValue LHSExt = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, LHS);
++  SDValue RHSExt = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, RHS);
++  SDValue NegRHSExt =DAG.getNode(ISD::FNEG, SL, MVT::f32, RHSExt);
++  SDValue Rcp = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, RHSExt);
++  SDValue Quot = DAG.getNode(ISD::FMUL, SL, MVT::f32, LHSExt, Rcp);
++  SDValue Err = DAG.getNode(ISD::FMA, SL, MVT::f32, NegRHSExt, Quot, LHSExt);
++  Quot = DAG.getNode(ISD::FMA, SL, MVT::f32, Err, Rcp, Quot);
++  Err = DAG.getNode(ISD::FMA, SL, MVT::f32, NegRHSExt, Quot, LHSExt);
++  SDValue Tmp = DAG.getNode(ISD::FMUL, SL, MVT::f32, Err, Rcp);
++  SDValue TmpCast = DAG.getNode(ISD::BITCAST, SL, MVT::i32, Tmp);
++  TmpCast = DAG.getNode(ISD::AND, SL, MVT::i32, TmpCast,
++                    DAG.getTargetConstant(0xff800000, SL, MVT::i32));
++  Tmp = DAG.getNode(ISD::BITCAST, SL, MVT::f32, TmpCast);
++  Quot = DAG.getNode(ISD::FADD, SL, MVT::f32, Tmp, Quot);
+   SDValue FPRoundFlag = DAG.getTargetConstant(0, SL, MVT::i32);
+-  SDValue BestQuot = DAG.getNode(ISD::FP_ROUND, SL, MVT::f16, Quot, FPRoundFlag);
+-
+-  return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f16, BestQuot, Src1, Src0);
++  SDValue RDst = DAG.getNode(ISD::FP_ROUND, SL, MVT::f16, Quot, FPRoundFlag);
++  return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f16, RDst, RHS, LHS);
+ }
+ 
+ // Faster 2.5 ULP division that does not support denormals.
+diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
+index 89cd18a..c224621 100644
+--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
++++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
+@@ -57,43 +57,37 @@
+ ; GFX6-FLUSH-NEXT:    v_cvt_f16_f32_e32 v0, v0
+ ; GFX6-FLUSH-NEXT:    s_setpc_b64 s[30:31]
+ ;
+-; GFX8-LABEL: v_fdiv_f16:
+-; GFX8:       ; %bb.0:
+-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v3, v0
+-; GFX8-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX8-NEXT:    v_mul_f32_e32 v2, v3, v2
+-; GFX8-NEXT:    v_cvt_f16_f32_e32 v2, v2
+-; GFX8-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX8-NEXT:    s_setpc_b64 s[30:31]
+-;
+-; GFX9-IEEE-LABEL: v_fdiv_f16:
+-; GFX9-IEEE:       ; %bb.0:
+-; GFX9-IEEE-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v3, v0
+-; GFX9-IEEE-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX9-IEEE-NEXT:    v_mul_f32_e32 v2, v3, v2
+-; GFX9-IEEE-NEXT:    v_cvt_f16_f32_e32 v2, v2
+-; GFX9-IEEE-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX9-IEEE-NEXT:    s_setpc_b64 s[30:31]
+-;
+-; GFX9-FLUSH-LABEL: v_fdiv_f16:
+-; GFX9-FLUSH:       ; %bb.0:
+-; GFX9-FLUSH-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-FLUSH-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
+-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX9-FLUSH-NEXT:    s_setpc_b64 s[30:31]
++; GFX89-LABEL: v_fdiv_f16:
++; GFX89:       ; %bb.0:
++; GFX89-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
++; GFX89-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX89-NEXT:    v_cvt_f32_f16_e32 v3, v0
++; GFX89-NEXT:    v_rcp_f32_e32 v4, v2
++; GFX89-NEXT:    v_mul_f32_e32 v5, v3, v4
++; GFX89-NEXT:    v_fma_f32 v6, -v2, v5, v3
++; GFX89-NEXT:    v_fma_f32 v5, v6, v4, v5
++; GFX89-NEXT:    v_fma_f32 v2, -v2, v5, v3
++; GFX89-NEXT:    v_mul_f32_e32 v2, v2, v4
++; GFX89-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX89-NEXT:    v_add_f32_e32 v2, v2, v5
++; GFX89-NEXT:    v_cvt_f16_f32_e32 v2, v2
++; GFX89-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
++; GFX89-NEXT:    s_setpc_b64 s[30:31]
+ ;
+ ; GFX10-LABEL: v_fdiv_f16:
+ ; GFX10:       ; %bb.0:
+ ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+ ; GFX10-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX10-NEXT:    v_cvt_f32_f16_e32 v3, v0
+ ; GFX10-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX10-NEXT:    v_fma_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
++; GFX10-NEXT:    v_mul_f32_e32 v3, v3, v2
++; GFX10-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_fmac_f32_e32 v3, v4, v2
++; GFX10-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_mul_f32_e32 v2, v4, v2
++; GFX10-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX10-NEXT:    v_add_f32_e32 v2, v2, v3
++; GFX10-NEXT:    v_cvt_f16_f32_e32 v2, v2
+ ; GFX10-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+ ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+ ;
+@@ -101,9 +95,17 @@
+ ; GFX11:       ; %bb.0:
+ ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+ ; GFX11-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX11-NEXT:    v_cvt_f32_f16_e32 v3, v0
+ ; GFX11-NEXT:    v_rcp_f32_e32 v2, v2
+ ; GFX11-NEXT:    s_waitcnt_depctr 0xfff
+-; GFX11-NEXT:    v_fma_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
++; GFX11-NEXT:    v_mul_f32_e32 v3, v3, v2
++; GFX11-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_fmac_f32_e32 v3, v4, v2
++; GFX11-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_mul_f32_e32 v2, v4, v2
++; GFX11-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX11-NEXT:    v_add_f32_e32 v2, v2, v3
++; GFX11-NEXT:    v_cvt_f16_f32_e32 v2, v2
+ ; GFX11-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+ ; GFX11-NEXT:    s_setpc_b64 s[30:31]
+   %fdiv = fdiv half %a, %b
+@@ -188,43 +190,37 @@
+ ; GFX6-FLUSH-NEXT:    v_cvt_f16_f32_e32 v0, v0
+ ; GFX6-FLUSH-NEXT:    s_setpc_b64 s[30:31]
+ ;
+-; GFX8-LABEL: v_fdiv_f16_ulp25:
+-; GFX8:       ; %bb.0:
+-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v3, v0
+-; GFX8-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX8-NEXT:    v_mul_f32_e32 v2, v3, v2
+-; GFX8-NEXT:    v_cvt_f16_f32_e32 v2, v2
+-; GFX8-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX8-NEXT:    s_setpc_b64 s[30:31]
+-;
+-; GFX9-IEEE-LABEL: v_fdiv_f16_ulp25:
+-; GFX9-IEEE:       ; %bb.0:
+-; GFX9-IEEE-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v3, v0
+-; GFX9-IEEE-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX9-IEEE-NEXT:    v_mul_f32_e32 v2, v3, v2
+-; GFX9-IEEE-NEXT:    v_cvt_f16_f32_e32 v2, v2
+-; GFX9-IEEE-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX9-IEEE-NEXT:    s_setpc_b64 s[30:31]
+-;
+-; GFX9-FLUSH-LABEL: v_fdiv_f16_ulp25:
+-; GFX9-FLUSH:       ; %bb.0:
+-; GFX9-FLUSH-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-FLUSH-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
+-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+-; GFX9-FLUSH-NEXT:    s_setpc_b64 s[30:31]
++; GFX89-LABEL: v_fdiv_f16_ulp25:
++; GFX89:       ; %bb.0:
++; GFX89-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
++; GFX89-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX89-NEXT:    v_cvt_f32_f16_e32 v3, v0
++; GFX89-NEXT:    v_rcp_f32_e32 v4, v2
++; GFX89-NEXT:    v_mul_f32_e32 v5, v3, v4
++; GFX89-NEXT:    v_fma_f32 v6, -v2, v5, v3
++; GFX89-NEXT:    v_fma_f32 v5, v6, v4, v5
++; GFX89-NEXT:    v_fma_f32 v2, -v2, v5, v3
++; GFX89-NEXT:    v_mul_f32_e32 v2, v2, v4
++; GFX89-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX89-NEXT:    v_add_f32_e32 v2, v2, v5
++; GFX89-NEXT:    v_cvt_f16_f32_e32 v2, v2
++; GFX89-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
++; GFX89-NEXT:    s_setpc_b64 s[30:31]
+ ;
+ ; GFX10-LABEL: v_fdiv_f16_ulp25:
+ ; GFX10:       ; %bb.0:
+ ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+ ; GFX10-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX10-NEXT:    v_cvt_f32_f16_e32 v3, v0
+ ; GFX10-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX10-NEXT:    v_fma_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
++; GFX10-NEXT:    v_mul_f32_e32 v3, v3, v2
++; GFX10-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_fmac_f32_e32 v3, v4, v2
++; GFX10-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_mul_f32_e32 v2, v4, v2
++; GFX10-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX10-NEXT:    v_add_f32_e32 v2, v2, v3
++; GFX10-NEXT:    v_cvt_f16_f32_e32 v2, v2
+ ; GFX10-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+ ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+ ;
+@@ -232,9 +228,17 @@
+ ; GFX11:       ; %bb.0:
+ ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+ ; GFX11-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX11-NEXT:    v_cvt_f32_f16_e32 v3, v0
+ ; GFX11-NEXT:    v_rcp_f32_e32 v2, v2
+ ; GFX11-NEXT:    s_waitcnt_depctr 0xfff
+-; GFX11-NEXT:    v_fma_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
++; GFX11-NEXT:    v_mul_f32_e32 v3, v3, v2
++; GFX11-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_fmac_f32_e32 v3, v4, v2
++; GFX11-NEXT:    v_fma_mix_f32 v4, -v1, v3, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_mul_f32_e32 v2, v4, v2
++; GFX11-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX11-NEXT:    v_add_f32_e32 v2, v2, v3
++; GFX11-NEXT:    v_cvt_f16_f32_e32 v2, v2
+ ; GFX11-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
+ ; GFX11-NEXT:    s_setpc_b64 s[30:31]
+   %fdiv = fdiv half %a, %b
+@@ -673,59 +677,67 @@
+ ; GFX8-LABEL: v_fdiv_v2f16:
+ ; GFX8:       ; %bb.0:
+ ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX8-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v3, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v5, v4
+-; GFX8-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v6, v0
+-; GFX8-NEXT:    v_rcp_f32_e32 v3, v3
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v7, v2
+-; GFX8-NEXT:    v_rcp_f32_e32 v5, v5
+-; GFX8-NEXT:    v_mul_f32_e32 v3, v6, v3
+-; GFX8-NEXT:    v_cvt_f16_f32_e32 v3, v3
+-; GFX8-NEXT:    v_mul_f32_e32 v5, v7, v5
+-; GFX8-NEXT:    v_cvt_f16_f32_e32 v5, v5
+-; GFX8-NEXT:    v_div_fixup_f16 v0, v3, v1, v0
+-; GFX8-NEXT:    v_div_fixup_f16 v1, v5, v4, v2
++; GFX8-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX8-NEXT:    v_cvt_f32_f16_e32 v4, v0
++; GFX8-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
++; GFX8-NEXT:    v_cvt_f32_f16_e32 v8, v6
++; GFX8-NEXT:    v_rcp_f32_e32 v5, v2
++; GFX8-NEXT:    v_lshrrev_b32_e32 v3, 16, v0
++; GFX8-NEXT:    v_cvt_f32_f16_e32 v7, v3
++; GFX8-NEXT:    v_mul_f32_e32 v9, v4, v5
++; GFX8-NEXT:    v_fma_f32 v10, -v2, v9, v4
++; GFX8-NEXT:    v_fma_f32 v9, v10, v5, v9
++; GFX8-NEXT:    v_fma_f32 v2, -v2, v9, v4
++; GFX8-NEXT:    v_rcp_f32_e32 v4, v8
++; GFX8-NEXT:    v_mul_f32_e32 v2, v2, v5
++; GFX8-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX8-NEXT:    v_add_f32_e32 v2, v2, v9
++; GFX8-NEXT:    v_mul_f32_e32 v5, v7, v4
++; GFX8-NEXT:    v_fma_f32 v9, -v8, v5, v7
++; GFX8-NEXT:    v_fma_f32 v5, v9, v4, v5
++; GFX8-NEXT:    v_fma_f32 v7, -v8, v5, v7
++; GFX8-NEXT:    v_mul_f32_e32 v4, v7, v4
++; GFX8-NEXT:    v_and_b32_e32 v4, 0xff800000, v4
++; GFX8-NEXT:    v_add_f32_e32 v4, v4, v5
++; GFX8-NEXT:    v_cvt_f16_f32_e32 v2, v2
++; GFX8-NEXT:    v_cvt_f16_f32_e32 v4, v4
++; GFX8-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
++; GFX8-NEXT:    v_div_fixup_f16 v1, v4, v6, v3
+ ; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+ ; GFX8-NEXT:    v_or_b32_e32 v0, v0, v1
+ ; GFX8-NEXT:    s_setpc_b64 s[30:31]
+ ;
+-; GFX9-IEEE-LABEL: v_fdiv_v2f16:
+-; GFX9-IEEE:       ; %bb.0:
+-; GFX9-IEEE-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-IEEE-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v3, v1
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v5, v4
+-; GFX9-IEEE-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v6, v0
+-; GFX9-IEEE-NEXT:    v_rcp_f32_e32 v3, v3
+-; GFX9-IEEE-NEXT:    v_cvt_f32_f16_e32 v7, v2
+-; GFX9-IEEE-NEXT:    v_rcp_f32_e32 v5, v5
+-; GFX9-IEEE-NEXT:    v_mul_f32_e32 v3, v6, v3
+-; GFX9-IEEE-NEXT:    v_cvt_f16_f32_e32 v3, v3
+-; GFX9-IEEE-NEXT:    v_mul_f32_e32 v5, v7, v5
+-; GFX9-IEEE-NEXT:    v_cvt_f16_f32_e32 v5, v5
+-; GFX9-IEEE-NEXT:    v_div_fixup_f16 v0, v3, v1, v0
+-; GFX9-IEEE-NEXT:    v_div_fixup_f16 v1, v5, v4, v2
+-; GFX9-IEEE-NEXT:    v_pack_b32_f16 v0, v0, v1
+-; GFX9-IEEE-NEXT:    s_setpc_b64 s[30:31]
+-;
+-; GFX9-FLUSH-LABEL: v_fdiv_v2f16:
+-; GFX9-FLUSH:       ; %bb.0:
+-; GFX9-FLUSH-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX9-FLUSH-NEXT:    v_cvt_f32_f16_e32 v2, v1
+-; GFX9-FLUSH-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+-; GFX9-FLUSH-NEXT:    v_cvt_f32_f16_e32 v4, v3
+-; GFX9-FLUSH-NEXT:    v_lshrrev_b32_e32 v5, 16, v0
+-; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v2, v2
+-; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v4, v4
+-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v2, v0, v2, 0 op_sel_hi:[1,0,0]
+-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v1, v2, v1, v0
+-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v0, v0, v4, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
+-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v0, v0, v3, v5
+-; GFX9-FLUSH-NEXT:    v_pack_b32_f16 v0, v1, v0
+-; GFX9-FLUSH-NEXT:    s_setpc_b64 s[30:31]
++; GFX9-LABEL: v_fdiv_v2f16:
++; GFX9:       ; %bb.0:
++; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
++; GFX9-NEXT:    v_cvt_f32_f16_e32 v2, v1
++; GFX9-NEXT:    v_cvt_f32_f16_e32 v4, v0
++; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
++; GFX9-NEXT:    v_cvt_f32_f16_e32 v8, v6
++; GFX9-NEXT:    v_rcp_f32_e32 v5, v2
++; GFX9-NEXT:    v_lshrrev_b32_e32 v3, 16, v0
++; GFX9-NEXT:    v_cvt_f32_f16_e32 v7, v3
++; GFX9-NEXT:    v_mul_f32_e32 v9, v4, v5
++; GFX9-NEXT:    v_fma_f32 v10, -v2, v9, v4
++; GFX9-NEXT:    v_fma_f32 v9, v10, v5, v9
++; GFX9-NEXT:    v_fma_f32 v2, -v2, v9, v4
++; GFX9-NEXT:    v_rcp_f32_e32 v4, v8
++; GFX9-NEXT:    v_mul_f32_e32 v2, v2, v5
++; GFX9-NEXT:    v_and_b32_e32 v2, 0xff800000, v2
++; GFX9-NEXT:    v_add_f32_e32 v2, v2, v9
++; GFX9-NEXT:    v_mul_f32_e32 v5, v7, v4
++; GFX9-NEXT:    v_fma_f32 v9, -v8, v5, v7
++; GFX9-NEXT:    v_fma_f32 v5, v9, v4, v5
++; GFX9-NEXT:    v_fma_f32 v7, -v8, v5, v7
++; GFX9-NEXT:    v_mul_f32_e32 v4, v7, v4
++; GFX9-NEXT:    v_and_b32_e32 v4, 0xff800000, v4
++; GFX9-NEXT:    v_add_f32_e32 v4, v4, v5
++; GFX9-NEXT:    v_cvt_f16_f32_e32 v2, v2
++; GFX9-NEXT:    v_cvt_f16_f32_e32 v4, v4
++; GFX9-NEXT:    v_div_fixup_f16 v0, v2, v1, v0
++; GFX9-NEXT:    v_div_fixup_f16 v1, v4, v6, v3
++; GFX9-NEXT:    v_pack_b32_f16 v0, v0, v1
++; GFX9-NEXT:    s_setpc_b64 s[30:31]
+ ;
+ ; GFX10-LABEL: v_fdiv_v2f16:
+ ; GFX10:       ; %bb.0:
+@@ -733,11 +745,27 @@
+ ; GFX10-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
+ ; GFX10-NEXT:    v_cvt_f32_f16_e32 v3, v1
+ ; GFX10-NEXT:    v_lshrrev_b32_e32 v5, 16, v0
++; GFX10-NEXT:    v_cvt_f32_f16_e32 v6, v0
+ ; GFX10-NEXT:    v_cvt_f32_f16_e32 v4, v2
+ ; GFX10-NEXT:    v_rcp_f32_e32 v3, v3
++; GFX10-NEXT:    v_cvt_f32_f16_e32 v7, v5
+ ; GFX10-NEXT:    v_rcp_f32_e32 v4, v4
+-; GFX10-NEXT:    v_fma_mixlo_f16 v3, v0, v3, 0 op_sel_hi:[1,0,0]
+-; GFX10-NEXT:    v_fma_mixlo_f16 v4, v0, v4, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
++; GFX10-NEXT:    v_mul_f32_e32 v6, v6, v3
++; GFX10-NEXT:    v_mul_f32_e32 v7, v7, v4
++; GFX10-NEXT:    v_fma_mix_f32 v8, -v1, v6, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_fma_mix_f32 v9, -v1, v7, v0 op_sel:[1,0,1] op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_fmac_f32_e32 v6, v8, v3
++; GFX10-NEXT:    v_fmac_f32_e32 v7, v9, v4
++; GFX10-NEXT:    v_fma_mix_f32 v8, -v1, v6, v0 op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_fma_mix_f32 v9, -v1, v7, v0 op_sel:[1,0,1] op_sel_hi:[1,0,1]
++; GFX10-NEXT:    v_mul_f32_e32 v3, v8, v3
++; GFX10-NEXT:    v_mul_f32_e32 v4, v9, v4
++; GFX10-NEXT:    v_and_b32_e32 v3, 0xff800000, v3
++; GFX10-NEXT:    v_and_b32_e32 v4, 0xff800000, v4
++; GFX10-NEXT:    v_add_f32_e32 v3, v3, v6
++; GFX10-NEXT:    v_add_f32_e32 v4, v4, v7
++; GFX10-NEXT:    v_cvt_f16_f32_e32 v3, v3
++; GFX10-NEXT:    v_cvt_f16_f32_e32 v4, v4
+ ; GFX10-NEXT:    v_div_fixup_f16 v0, v3, v1, v0
+ ; GFX10-NEXT:    v_div_fixup_f16 v1, v4, v2, v5
+ ; GFX10-NEXT:    v_pack_b32_f16 v0, v0, v1
+@@ -749,12 +777,24 @@
+ ; GFX11-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
+ ; GFX11-NEXT:    v_cvt_f32_f16_e32 v3, v1
+ ; GFX11-NEXT:    v_lshrrev_b32_e32 v5, 16, v0
++; GFX11-NEXT:    v_cvt_f32_f16_e32 v6, v0
+ ; GFX11-NEXT:    v_cvt_f32_f16_e32 v4, v2
+ ; GFX11-NEXT:    v_rcp_f32_e32 v3, v3
++; GFX11-NEXT:    v_cvt_f32_f16_e32 v7, v5
+ ; GFX11-NEXT:    v_rcp_f32_e32 v4, v4
+ ; GFX11-NEXT:    s_waitcnt_depctr 0xfff
+-; GFX11-NEXT:    v_fma_mixlo_f16 v3, v0, v3, 0 op_sel_hi:[1,0,0]
+-; GFX11-NEXT:    v_fma_mixlo_f16 v4, v0, v4, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
++; GFX11-NEXT:    v_dual_mul_f32 v6, v6, v3 :: v_dual_mul_f32 v7, v7, v4
++; GFX11-NEXT:    v_fma_mix_f32 v8, -v1, v6, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_fma_mix_f32 v9, -v1, v7, v0 op_sel:[1,0,1] op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_dual_fmac_f32 v6, v8, v3 :: v_dual_fmac_f32 v7, v9, v4
++; GFX11-NEXT:    v_fma_mix_f32 v8, -v1, v6, v0 op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_fma_mix_f32 v9, -v1, v7, v0 op_sel:[1,0,1] op_sel_hi:[1,0,1]
++; GFX11-NEXT:    v_dual_mul_f32 v3, v8, v3 :: v_dual_mul_f32 v4, v9, v4
++; GFX11-NEXT:    v_and_b32_e32 v3, 0xff800000, v3
++; GFX11-NEXT:    v_dual_add_f32 v3, v3, v6 :: v_dual_and_b32 v4, 0xff800000, v4
++; GFX11-NEXT:    v_add_f32_e32 v4, v4, v7
++; GFX11-NEXT:    v_cvt_f16_f32_e32 v3, v3
++; GFX11-NEXT:    v_cvt_f16_f32_e32 v4, v4
+ ; GFX11-NEXT:    v_div_fixup_f16 v0, v3, v1, v0
+ ; GFX11-NEXT:    v_div_fixup_f16 v1, v4, v2, v5
+ ; GFX11-NEXT:    v_pack_b32_f16 v0, v0, v1
+@@ -900,59 +940,67 @@
+ ; GFX8-LABEL: v_fdiv_v2f16_ulp25:
+ ; GFX8:       ; %bb.0:
+ ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+-; GFX8-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v3, v1
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v5, v4
+-; GFX8-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v6, v0
+-; GFX8-NEXT:    v_rcp_f32_e32 v3, v3
+-; GFX8-NEXT:    v_cvt_f32_f16_e32 v7, v2
+-; GFX8-NEXT:    v_rcp_f32_e32 v5, v5
+-; GFX8-NEXT:    v_mul_f32_e32 v3, v6, v3
+-; GFX8-NEXT:    v_cvt_f16_f32_e32 v3, v3
+-; GFX...
[truncated]

@shiltian shiltian force-pushed the users/shiltian/fp16-correctly-rounded-sequence branch from fa5cc64 to 0b9f3f5 Compare September 19, 2024 14:59
Copy link

github-actions bot commented Sep 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 19, 2024

Can you explain the algorithm? Or at least add a comment showing what the algorithm does in pseudocode, so I don't have to work it out from all the getNode calls?

@shiltian
Copy link
Contributor Author

shiltian commented Sep 19, 2024

Can you explain the algorithm? Or at least add a comment showing what the algorithm does in pseudocode, so I don't have to work it out from all the getNode calls?

Done. Please refer to the ticket (as well as the Gerrit PR attached to it) for more details.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 19, 2024

I meant as a comment in the code.

@shiltian shiltian force-pushed the users/shiltian/fp16-correctly-rounded-sequence branch from 0b9f3f5 to f8e0d13 Compare September 19, 2024 15:47
@shiltian
Copy link
Contributor Author

I meant as a comment in the code.

Done.

The current lowering of fdiv16 can generate incorrectly rounded result in some
cases.

Fixes SWDEV-47760.
@shiltian shiltian force-pushed the users/shiltian/fp16-correctly-rounded-sequence branch from f8e0d13 to 0ef2672 Compare September 19, 2024 20:12
SDValue Tmp = DAG.getNode(ISD::FMUL, SL, MVT::f32, Err, Rcp);
SDValue TmpCast = DAG.getNode(ISD::BITCAST, SL, MVT::i32, Tmp);
TmpCast = DAG.getNode(ISD::AND, SL, MVT::i32, TmpCast,
DAG.getTargetConstant(0xff800000, SL, MVT::i32));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a TargetConstant and should be a regular Constant

Err = DAG.getNode(ISD::FMA, SL, MVT::f32, NegRHSExt, Quot, LHSExt);
SDValue Tmp = DAG.getNode(ISD::FMUL, SL, MVT::f32, Err, Rcp);
SDValue TmpCast = DAG.getNode(ISD::BITCAST, SL, MVT::i32, Tmp);
TmpCast = DAG.getNode(ISD::AND, SL, MVT::i32, TmpCast,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this part, it's and by -infinity, so it's extracting the significand. Is it possible to use frexp_mant instead?

@jayfoad
Copy link
Contributor

jayfoad commented Sep 20, 2024

[AMDGPU] Adapt new lowering sequence for fdiv16

Did you mean "adopt"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants