Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The assembly emitted for some arm neon fma intrinsics contains function calls #1212

Closed
hkratz opened this issue Sep 8, 2021 · 1 comment · Fixed by #1219
Closed

The assembly emitted for some arm neon fma intrinsics contains function calls #1212

hkratz opened this issue Sep 8, 2021 · 1 comment · Fixed by #1219

Comments

@hkratz
Copy link
Contributor

hkratz commented Sep 8, 2021

vfma_n_f32, vfmaq_n_f32, vfms_n_f32 and vfms_n_f32are affected.

From a CI with inlining check enabled (https://github.com/rust-lang/stdarch/runs/3547875318):

---- core_arch::arm_shared::neon::generated::assert_vfma_n_f32_vfma stdout ----
disassembly for stdarch_test_shim_vfma_n_f32_vfma: 
	 0: push {fp, lr}
	 1: vpush {d8-d9}
	 2: sub sp, sp, #8
	 3: ldr r0, [pc, #56] ; 395e0 <stdarch_test_shim_vfma_n_f32_vfma+0x4c>
	 4: vorr d9, d0, d0
	 5: ldr r1, [pc, #52] ; 395e4 <stdarch_test_shim_vfma_n_f32_vfma+0x50>
	 6: vmov.f32 s0, s4
	 7: add r0, pc, r0
	 8: vorr d8, d1, d1
	 9: ldr r1, [pc, r1]
	10: str r0, [r1]
	11: mov r0, sp
	12: bl 34030 <_ZN9core_arch9core_arch10arm_shared4neon10vdup_n_f3217h6b11480dbcf9ce6fE>
	13: vldr d16, [sp]
	14: vfma.f32 d9, d8, d16
	15: vorr d0, d9, d9
	16: add sp, sp, #8
	17: vpop {d8-d9}
	18: pop {fp, pc}
	19: .word 0x0010780f
	20: .word 0x001be9d8
thread 'core_arch::arm_shared::neon::generated::assert_vfma_n_f32_vfma' panicked at 'instruction found, but the disassembly contains subroutine call instructions, which hint that inlining failed', crates/stdarch-test/src/lib.rs:177:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- core_arch::arm_shared::neon::generated::assert_vfmaq_n_f32_vfma stdout ----
disassembly for stdarch_test_shim_vfmaq_n_f32_vfma: 
	 0: push {r4, lr}
	 1: vpush {d8-d11}
	 2: sub sp, sp, #16
	 3: ldr r0, [pc, #60] ; 3958c <stdarch_test_shim_vfmaq_n_f32_vfma+0x50>
	 4: vorr q5, q0, q0
	 5: ldr r1, [pc, #56] ; 39590 <stdarch_test_shim_vfmaq_n_f32_vfma+0x54>
	 6: vmov.f32 s0, s8
	 7: add r0, pc, r0
	 8: mov r4, sp
	 9: vorr q4, q1, q1
	10: ldr r1, [pc, r1]
	11: str r0, [r1]
	12: mov r0, r4
	13: bl 34024 <_ZN9core_arch9core_arch10arm_shared4neon11vdupq_n_f3217h03e9053c684c2898E>
	14: vld1.64 {d16-d17}, [r4]
	15: vfma.f32 q5, q4, q8
	16: vorr q0, q5, q5
	17: add sp, sp, #16
	18: vpop {d8-d11}
	19: pop {r4, pc}
	20: .word 0x0010780d
	21: .word 0x001bea2c
thread 'core_arch::arm_shared::neon::generated::assert_vfmaq_n_f32_vfma' panicked at 'instruction found, but the disassembly contains subroutine call instructions, which hint that inlining failed', crates/stdarch-test/src/lib.rs:177:9

---- core_arch::arm_shared::neon::generated::assert_vfms_n_f32_vfms stdout ----
disassembly for stdarch_test_shim_vfms_n_f32_vfms: 
	 0: push {fp, lr}
	 1: vpush {d8-d9}
	 2: sub sp, sp, #8
	 3: ldr r0, [pc, #56] ; 394ec <stdarch_test_shim_vfms_n_f32_vfms+0x4c>
	 4: vorr d9, d0, d0
	 5: ldr r1, [pc, #52] ; 394f0 <stdarch_test_shim_vfms_n_f32_vfms+0x50>
	 6: vmov.f32 s0, s4
	 7: add r0, pc, r0
	 8: vorr d8, d1, d1
	 9: ldr r1, [pc, r1]
	10: str r0, [r1]
	11: mov r0, sp
	12: bl 34030 <_ZN9core_arch9core_arch10arm_shared4neon10vdup_n_f3217h6b11480dbcf9ce6fE>
	13: vldr d16, [sp]
	14: vfms.f32 d9, d8, d16
	15: vorr d0, d9, d9
	16: add sp, sp, #8
	17: vpop {d8-d9}
	18: pop {fp, pc}
	19: .word 0x001077c9
	20: .word 0x001beacc
thread 'core_arch::arm_shared::neon::generated::assert_vfms_n_f32_vfms' panicked at 'instruction found, but the disassembly contains subroutine call instructions, which hint that inlining failed', crates/stdarch-test/src/lib.rs:177:9

---- core_arch::arm_shared::neon::generated::assert_vfmsq_n_f32_vfms stdout ----
disassembly for stdarch_test_shim_vfmsq_n_f32_vfms: 
	 0: push {r4, lr}
	 1: vpush {d8-d11}
	 2: sub sp, sp, #16
	 3: ldr r0, [pc, #60] ; 39498 <stdarch_test_shim_vfmsq_n_f32_vfms+0x50>
	 4: vorr q5, q0, q0
	 5: ldr r1, [pc, #56] ; 3949c <stdarch_test_shim_vfmsq_n_f32_vfms+0x54>
	 6: vmov.f32 s0, s8
	 7: add r0, pc, r0
	 8: mov r4, sp
	 9: vorr q4, q1, q1
	10: ldr r1, [pc, r1]
	11: str r0, [r1]
	12: mov r0, r4
	13: bl 34024 <_ZN9core_arch9core_arch10arm_shared4neon11vdupq_n_f3217h03e9053c684c2898E>
	14: vld1.64 {d16-d17}, [r4]
	15: vfms.f32 q5, q4, q8
	16: vorr q0, q5, q5
	17: add sp, sp, #16
	18: vpop {d8-d11}
	19: pop {r4, pc}
	20: .word 0x001077c7
	21: .word 0x001beb20
thread 'core_arch::arm_shared::neon::generated::assert_vfmsq_n_f32_vfms' panicked at 'instruction found, but the disassembly contains subroutine call instructions, which hint that inlining failed', crates/stdarch-test/src/lib.rs:177:9

cc @SparrowLii

@hkratz
Copy link
Contributor Author

hkratz commented Sep 9, 2021

The problem is that the VFMA functions have target_feature(enable = "fp-armv8,v8") while the called functions vdup_n_f32 and vdupq_n_f32 are target_feature(enable = "v7"). Using private _v8 variants of those functions causes them to be inlined.

Before I submit a PR though... Isn't VFMA already supported with vfp4+neon? Should we use v7,vfp4 instead of fp-armv8 and v8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant