Optimized conversions between `Half` and `Single`. #81632

MineCake147E · 2023-02-04T09:03:28Z

Replaced both public static explicit operator Half(float value) and public static explicit operator float(Half value) with new algorithm.
~~I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.~~
I was wrong. CMake hasn't been PATHed properly, and it was guided to a wrong path of cl.exe.
It passed the test System.Tests.HalfTests.
If merged, closes #69667.

Closes dotnet#69667.

ghost · 2023-02-04T09:03:39Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Replaced both public static explicit operator Half(float value) and public static explicit operator float(Half value) with new algorithm.
I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.
If merged, closes #69667.

Author:	MineCake147E
Assignees:	-
Labels:	`area-System.Numerics`
Milestone:	-

dnfadmin · 2023-02-04T09:03:43Z

All CLA requirements met.

tannergooding · 2023-02-04T15:39:45Z

Thanks for the contribution!

This really needs perf numbers to be provided as well as comments elaborating what the code is doing. SIMD code is not necessarily self explanatory and any future readers of the code will not want to spend 20-30 minutes deciphering each step ;)

I also expect this can be simplified down quite a bit. The vast majority of what's being done is basic arithmetic and potentially something we can recognize some patterns around and simply improve the JIT so that the general case is faster and we don't need specialized intrinsic paths. Providing a disassembly diff between the SIMD vs fallback path would help in seeing where the gaps exist

src/libraries/System.Private.CoreLib/src/System/Half.cs

Added some comments describing algorithm.

…plicit operator Half(float value)`

tannergooding · 2023-02-09T13:52:55Z

I already provided performance numbers in #69667 (comment) in "My proposal for new software fallback converting..." section.

Can you also post current/updated numbers to this PR. It is important we have up to date information with regards to the benefit vs complexity.

Ideally this would be provided in identically laid out before/after tables or even better using the BDN comparison feature so that seeing the change is trivial. Preferably also using clear names such as SimpleLoopBase, and SimpleLoopDiff to clearly disambiguate.

There is quite a bit of extra complexity here, with the general code being much harder to read/reason about and some of the up front cost could be removed in other ways (e.g. using float.IsFinite(value) instead of directly comparing the exponent to handle NaN/Infinity and using simple comparisons or APIs like float.IsNegative rather than directly handling the sign)

If there are some simple bottlenecks in the existing code, it would be preferred to try to optimize those in the JIT as well and with the new cmov support that went in, that might be feasible.

tannergooding · 2023-02-09T14:22:11Z

I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.

Can you provide more information as to the failures you're seeing and the process you're following? The general workflow docs for building and testing are here: https://github.com/dotnet/runtime/tree/main/docs/workflow

danmoseley · 2023-02-16T05:24:39Z

@MineCake147E Do docs in the dotnet/performance repo help?

MineCake147E · 2023-02-16T09:48:49Z

@danmoseley Thank you for your information!
I will take a look at it.

dje-dev · 2023-02-24T00:59:11Z

Lack of fast (ideally hardware accelerated) FP16 (Half) conversions have been one of the largest limitations in .NET in recent years for my machine learning work. Neural net weights files often consist of hundreds of millions or billions of parameters in fp32 that need to be converted to execute in FP16 on GPU. It's slow (bandwidth limited) to upload in fp32and then convert on GPU. Hence the need to do this quickly in .NET. As I understand from Tanner there is no hardware support planned for the near future..

Two reasons to consider accepting this PR:

speedup is large. On .NET 8 preview 1 the conversion to Half is more than 2x faster compared conversion in Half.cs.
this is easy to test exhaustively - there is only a single scalar input to the function, and for the HalfToFloat conversion there are only 65536 inputs that can be exhaustively prove correct.

tannergooding · 2023-02-27T16:02:38Z

Is benchmarking with Private Runtime Build even necessary in the first place? I didn't touch anything of CLR.

Yes, because you need to compare "before" vs "after" and the "after" only exists in a local build of the dotnet/runtime (e.g. includes this PR).

Even if it's required, should I add a new benchmark for this specific performance measurement in dotnet/performance repository? I couldn't find any benchmark for Half conversions.

Yes. We generally want most perf driven changes to have a corresponding benchmark so performance can be tracked over time.

speedup is large. On .NET 8 preview 1 the conversion to Half is more than 2x faster compared conversion in Half.cs.

Yes, the speedup is large but it comes at the cost of a lot of complexity and if we actually care about performance then we'd likely want to implement this using FP16 or F16C instead, where it can be done in ~16 or fewer cycles.

This software path would then remain for hardware without F16C support (which requires older than ~2011 for AMD and ~2013 for Intel).

My biggest concern is then the complexity around it and I'm very much interested in whether the same "tricks" can be handled another way. The current perf difference is largely due to the branching the current implementation has and, particularly with the cmov support, I expect we could make this significantly cheaper/simpler for the "fallback path" without hurting readability/maintainability.

src/libraries/System.Private.CoreLib/src/System/Half.cs

* Turned some magic numbers into constants

…-conversion-half-float

MineCake147E · 2023-05-25T04:40:47Z

are existing tests sufficient do you need to add more at the same time?

If any needs for exhaustive tests exist, I'll do it.
FYI, exhaustive test for float->Half conversion is done practically within my local environment.
By utilizing multiple cores, we can theoretically test for all possible float values, if we had any way to compare with the current implementation.

tannergooding · 2023-05-25T16:46:56Z

We do not need "exhaustive" tests covering every possible conversion. We only need to test a few common cases and the known interesting scenarios.

We should have tests covering most of those already.

MineCake147E · 2023-05-26T09:16:59Z

Here's the latest number measured in benchmarks proposed in dotnet/performance#2950 .

Comparison

summary:
better: 6, geomean: 2.664
total diff: 6

No Slower results for the provided threshold = 1% and noise filter = 0.3 ns.

Faster	base/diff	Base Median (ns)	Diff Median (ns)	Modality
System.Tests.Perf_Half.SingleToHalf(value: 6.097555E-05)	4.53	7.94	1.75
System.Tests.Perf_Half.SingleToHalf(value: 12345)	3.21	5.59	1.74
System.Tests.Perf_Half.HalfToSingle(value: 6.1E-05)	2.69	2.83	1.05
System.Tests.Perf_Half.SingleToHalf(value: 65520)	2.43	4.26	1.75
System.Tests.Perf_Half.HalfToSingle(value: NaN)	2.15	2.26	1.05	several?
System.Tests.Perf_Half.HalfToSingle(value: 12344)	1.75	2.26	1.29

Before

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-IQKHZZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
HalfToSingle	12344	2.268 ns	0.0614 ns	0.0513 ns	2.264 ns	2.177 ns	2.337 ns	-
SingleToHalf	NaN	1.580 ns	0.0509 ns	0.0451 ns	1.580 ns	1.497 ns	1.656 ns	-
SingleToHalf	6.097555E-05	7.954 ns	0.1442 ns	0.1204 ns	7.937 ns	7.770 ns	8.174 ns	-
SingleToHalf	12345	5.607 ns	0.1451 ns	0.1212 ns	5.592 ns	5.380 ns	5.841 ns	-
HalfToSingle	6.1E-05	2.823 ns	0.0668 ns	0.0624 ns	2.835 ns	2.719 ns	2.956 ns	-
SingleToHalf	65520	4.261 ns	0.0727 ns	0.0644 ns	4.259 ns	4.144 ns	4.388 ns	-
HalfToSingle	NaN	2.268 ns	0.0600 ns	0.0562 ns	2.256 ns	2.199 ns	2.390 ns	-

After

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-UTBGNX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
HalfToSingle	12344	1.295 ns	0.0448 ns	0.0397 ns	1.294 ns	1.2403 ns	1.363 ns	-
SingleToHalf	NaN	1.893 ns	0.1261 ns	0.1402 ns	1.894 ns	1.7108 ns	2.223 ns	-
SingleToHalf	6.097555E-05	1.760 ns	0.0529 ns	0.0469 ns	1.752 ns	1.6933 ns	1.856 ns	-
SingleToHalf	12345	1.749 ns	0.0372 ns	0.0348 ns	1.742 ns	1.6993 ns	1.816 ns	-
HalfToSingle	6.1E-05	1.061 ns	0.0495 ns	0.0463 ns	1.055 ns	1.0077 ns	1.161 ns	-
SingleToHalf	65520	1.752 ns	0.0386 ns	0.0361 ns	1.752 ns	1.7014 ns	1.825 ns	-
HalfToSingle	NaN	1.056 ns	0.0459 ns	0.0407 ns	1.049 ns	0.9778 ns	1.118 ns	-

…-conversion-half-float

adamsitnik

I am not going to lie, I am not an expert in numerics. But since both Half and Single are relatively small, I've written a small app that emits all possible inputs for casting and compares the old vs new result. It's available here. Note: I know it could be optimized, I just did not want to introduce bug.

The tests have passed, the new code produces exactly the same output for the same input.

Moreover, I've locally synced your branch and removed the aggressive inlining attribute (please see my comment) and benchmarked it by using the benchmarks from dotnet/performance#2950

For all cases except of NaN, the performance has improved. For NaNs the regression is acceptable (it's very small and I would not expect NaN to be a common input)

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

LaunchCount=3

Method	Job	value	Mean	Ratio
HalfToSingle	PR	12344	1.8440 ns	1.01
HalfToSingle	main	12344	1.8201 ns	1.00

SingleToHalf	PR	NaN	1.3870 ns	1.51
SingleToHalf	main	NaN	0.9199 ns	1.00

SingleToHalf	PR	6.097555E-05	1.3737 ns	0.37
SingleToHalf	main	6.097555E-05	3.7275 ns	1.00

SingleToHalf	PR	12345	1.3651 ns	0.53
SingleToHalf	main	12345	2.6000 ns	1.00

HalfToSingle	PR	6.1E-05	1.7831 ns	0.79
HalfToSingle	main	6.1E-05	2.2691 ns	1.00

SingleToHalf	PR	65520	1.3917 ns	0.67
SingleToHalf	main	65520	2.0886 ns	1.00

HalfToSingle	PR	NaN	1.8133 ns	1.14
HalfToSingle	main	NaN	1.5873 ns	1.00

@MineCake147E thank you for your contribution! As soon as you apply my suggestion I am going to approve and merge the PR.

src/libraries/System.Private.CoreLib/src/System/Half.cs

Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

adamsitnik

LGTM, again thank you for your contribution @MineCake147E !

MineCake147E · 2023-07-07T13:11:03Z

I am wondering why the NaN cases have regressed more than I expected in Zen2.
Is it due to differences in microarchitecture?

adamsitnik · 2023-07-07T14:59:18Z

I am wondering why the NaN cases have regressed more than I expected in Zen2.

I don't know and I am sorry because I currently don't have the free cycles to run a dedicated investigation for that.

Optimized conversions between Half and Single.

0190f96

Closes dotnet#69667.

dotnet-issue-labeler bot added the area-System.Numerics label Feb 4, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 4, 2023

build-analysis bot mentioned this pull request Feb 4, 2023

Test failure Loader\\classloader\\DictionaryExpansion\\DictionaryExpansion\\DictionaryExpansion.cmd #75244

Closed

This comment was marked as outdated.

Sign in to view

jkotas reviewed Feb 5, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved

Updated explicit operator Half(float value).

f36ab2b

Added some comments describing algorithm.

build-analysis bot mentioned this pull request Feb 5, 2023

Tracking issue for CI build timeouts #76454

Closed

Removed [MethodImpl(MethodImplOptions.AggressiveInlining)] from `ex…

b3b8eee

…plicit operator Half(float value)`

MineCake147E marked this pull request as ready for review February 9, 2023 03:49

MineCake147E requested a review from jkotas February 9, 2023 04:04

jkotas requested review from tannergooding and dakersnar and removed request for jkotas February 9, 2023 05:12

MineCake147E marked this pull request as draft February 9, 2023 21:16

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

tannergooding reviewed Feb 27, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved

tannergooding reviewed Feb 27, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request May 17, 2023

Test failure tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh #86233

Closed

MineCake147E added 2 commits May 19, 2023 18:20

Added explanation of explicit operator float

ddd6880

Removed error causing whitespaces at the end of lines

dc30370

This was referenced May 20, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

Build ios-arm64 Release AllSubsets_Mono failures dotnet/arcade#13625

Closed

danmoseley reviewed May 20, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Show resolved Hide resolved

danmoseley reviewed May 21, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved

+ Added explanation of explicit operator Half(float value)

d874357

* Turned some magic numbers into constants

MineCake147E marked this pull request as ready for review May 24, 2023 09:47

MineCake147E changed the title ~~[WIP] Optimized conversions between Half and Single.~~ Optimized conversions between Half and Single. May 24, 2023

Merge branch 'main' of https://github.com/dotnet/runtime into improve…

7e2fa18

…-conversion-half-float

MineCake147E added 2 commits May 26, 2023 10:51

Fixed misinformation in comments

05f7100

Optimized Single->Half conversion with subnormal result

858499c

build-analysis bot mentioned this pull request May 26, 2023

Assert failure in GC/API/NoGCRegion/Callback_Svr test #86612

Closed

Merge branch 'main' of https://github.com/dotnet/runtime into improve…

5dd0fec

…-conversion-half-float

adamsitnik reviewed Jul 7, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved

adamsitnik added tenet-performance Performance related issue needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 7, 2023

Update src/libraries/System.Private.CoreLib/src/System/Half.cs

3e5be81

Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 7, 2023

adamsitnik approved these changes Jul 7, 2023

View reviewed changes

adamsitnik self-assigned this Jul 7, 2023

adamsitnik merged commit 5a03596 into dotnet:main Jul 7, 2023
166 checks passed

adamsitnik added this to the 8.0.0 milestone Jul 7, 2023

ghost locked as resolved and limited conversation to collaborators Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized conversions between `Half` and `Single`. #81632

Optimized conversions between `Half` and `Single`. #81632

MineCake147E commented Feb 4, 2023 •

edited

Loading

ghost commented Feb 4, 2023

dnfadmin commented Feb 4, 2023 •

edited

Loading

tannergooding commented Feb 4, 2023 •

edited

Loading

This comment was marked as outdated.

tannergooding commented Feb 9, 2023

tannergooding commented Feb 9, 2023

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

danmoseley commented Feb 16, 2023

MineCake147E commented Feb 16, 2023

dje-dev commented Feb 24, 2023

This comment was marked as resolved.

tannergooding commented Feb 27, 2023

MineCake147E commented May 25, 2023 •

edited

Loading

tannergooding commented May 25, 2023

MineCake147E commented May 26, 2023 •

edited

Loading

adamsitnik left a comment

adamsitnik left a comment

MineCake147E commented Jul 7, 2023

adamsitnik commented Jul 7, 2023

Optimized conversions between Half and Single. #81632

Optimized conversions between Half and Single. #81632

Conversation

MineCake147E commented Feb 4, 2023 • edited Loading

ghost commented Feb 4, 2023

dnfadmin commented Feb 4, 2023 • edited Loading

tannergooding commented Feb 4, 2023 • edited Loading

This comment was marked as outdated.

tannergooding commented Feb 9, 2023

tannergooding commented Feb 9, 2023

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

danmoseley commented Feb 16, 2023

MineCake147E commented Feb 16, 2023

dje-dev commented Feb 24, 2023

This comment was marked as resolved.

tannergooding commented Feb 27, 2023

MineCake147E commented May 25, 2023 • edited Loading

tannergooding commented May 25, 2023

MineCake147E commented May 26, 2023 • edited Loading

Comparison

Before

After

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik left a comment

Choose a reason for hiding this comment

MineCake147E commented Jul 7, 2023

adamsitnik commented Jul 7, 2023

Optimized conversions between `Half` and `Single`. #81632

Optimized conversions between `Half` and `Single`. #81632

MineCake147E commented Feb 4, 2023 •

edited

Loading

dnfadmin commented Feb 4, 2023 •

edited

Loading

tannergooding commented Feb 4, 2023 •

edited

Loading

MineCake147E commented May 25, 2023 •

edited

Loading

MineCake147E commented May 26, 2023 •

edited

Loading