Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized conversions between Half and Single. #81632

Merged
merged 19 commits into from
Jul 7, 2023

Conversation

MineCake147E
Copy link
Contributor

@MineCake147E MineCake147E commented Feb 4, 2023

Replaced both public static explicit operator Half(float value) and public static explicit operator float(Half value) with new algorithm.
I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.
I was wrong. CMake hasn't been PATHed properly, and it was guided to a wrong path of cl.exe.
It passed the test System.Tests.HalfTests.
If merged, closes #69667.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 4, 2023
@ghost
Copy link

ghost commented Feb 4, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Replaced both public static explicit operator Half(float value) and public static explicit operator float(Half value) with new algorithm.
I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.
If merged, closes #69667.

Author: MineCake147E
Assignees: -
Labels:

area-System.Numerics

Milestone: -

@dnfadmin
Copy link

dnfadmin commented Feb 4, 2023

CLA assistant check
All CLA requirements met.

@tannergooding
Copy link
Member

tannergooding commented Feb 4, 2023

Thanks for the contribution!

This really needs perf numbers to be provided as well as comments elaborating what the code is doing. SIMD code is not necessarily self explanatory and any future readers of the code will not want to spend 20-30 minutes deciphering each step ;)

I also expect this can be simplified down quite a bit. The vast majority of what's being done is basic arithmetic and potentially something we can recognize some patterns around and simply improve the JIT so that the general case is faster and we don't need specialized intrinsic paths. Providing a disassembly diff between the SIMD vs fallback path would help in seeing where the gaps exist

@MineCake147E

This comment was marked as outdated.

Added some comments describing algorithm.
@MineCake147E MineCake147E marked this pull request as ready for review February 9, 2023 03:49
@jkotas jkotas requested review from tannergooding and dakersnar and removed request for jkotas February 9, 2023 05:12
@tannergooding
Copy link
Member

I already provided performance numbers in #69667 (comment) in "My proposal for new software fallback converting..." section.

Can you also post current/updated numbers to this PR. It is important we have up to date information with regards to the benefit vs complexity.

Ideally this would be provided in identically laid out before/after tables or even better using the BDN comparison feature so that seeing the change is trivial. Preferably also using clear names such as SimpleLoopBase, and SimpleLoopDiff to clearly disambiguate.

There is quite a bit of extra complexity here, with the general code being much harder to read/reason about and some of the up front cost could be removed in other ways (e.g. using float.IsFinite(value) instead of directly comparing the exponent to handle NaN/Infinity and using simple comparisons or APIs like float.IsNegative rather than directly handling the sign)

If there are some simple bottlenecks in the existing code, it would be preferred to try to optimize those in the JIT as well and with the new cmov support that went in, that might be feasible.

@tannergooding
Copy link
Member

I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.

Can you provide more information as to the failures you're seeing and the process you're following? The general workflow docs for building and testing are here: https://github.com/dotnet/runtime/tree/main/docs/workflow

@MineCake147E MineCake147E marked this pull request as draft February 9, 2023 21:16
@MineCake147E

This comment was marked as resolved.

@MineCake147E

This comment was marked as outdated.

@MineCake147E

This comment was marked as resolved.

@danmoseley
Copy link
Member

@MineCake147E Do docs in the dotnet/performance repo help?

@MineCake147E
Copy link
Contributor Author

@danmoseley Thank you for your information!
I will take a look at it.

@dje-dev
Copy link

dje-dev commented Feb 24, 2023

Lack of fast (ideally hardware accelerated) FP16 (Half) conversions have been one of the largest limitations in .NET in recent years for my machine learning work. Neural net weights files often consist of hundreds of millions or billions of parameters in fp32 that need to be converted to execute in FP16 on GPU. It's slow (bandwidth limited) to upload in fp32and then convert on GPU. Hence the need to do this quickly in .NET. As I understand from Tanner there is no hardware support planned for the near future..

Two reasons to consider accepting this PR:

  • speedup is large. On .NET 8 preview 1 the conversion to Half is more than 2x faster compared conversion in Half.cs.
  • this is easy to test exhaustively - there is only a single scalar input to the function, and for the HalfToFloat conversion there are only 65536 inputs that can be exhaustively prove correct.

@MineCake147E

This comment was marked as resolved.

@tannergooding
Copy link
Member

Is benchmarking with Private Runtime Build even necessary in the first place? I didn't touch anything of CLR.

Yes, because you need to compare "before" vs "after" and the "after" only exists in a local build of the dotnet/runtime (e.g. includes this PR).

Even if it's required, should I add a new benchmark for this specific performance measurement in dotnet/performance repository? I couldn't find any benchmark for Half conversions.

Yes. We generally want most perf driven changes to have a corresponding benchmark so performance can be tracked over time.

speedup is large. On .NET 8 preview 1 the conversion to Half is more than 2x faster compared conversion in Half.cs.

Yes, the speedup is large but it comes at the cost of a lot of complexity and if we actually care about performance then we'd likely want to implement this using FP16 or F16C instead, where it can be done in ~16 or fewer cycles.

This software path would then remain for hardware without F16C support (which requires older than ~2011 for AMD and ~2013 for Intel).

My biggest concern is then the complexity around it and I'm very much interested in whether the same "tricks" can be handled another way. The current perf difference is largely due to the branching the current implementation has and, particularly with the cmov support, I expect we could make this significantly cheaper/simpler for the "fallback path" without hurting readability/maintainability.

* Turned some magic numbers into constants
@MineCake147E MineCake147E marked this pull request as ready for review May 24, 2023 09:47
@MineCake147E MineCake147E changed the title [WIP] Optimized conversions between Half and Single. Optimized conversions between Half and Single. May 24, 2023
@MineCake147E
Copy link
Contributor Author

MineCake147E commented May 25, 2023

are existing tests sufficient do you need to add more at the same time?

If any needs for exhaustive tests exist, I'll do it.
FYI, exhaustive test for float->Half conversion is done practically within my local environment.
By utilizing multiple cores, we can theoretically test for all possible float values, if we had any way to compare with the current implementation.

@tannergooding
Copy link
Member

We do not need "exhaustive" tests covering every possible conversion. We only need to test a few common cases and the known interesting scenarios.

We should have tests covering most of those already.

@MineCake147E
Copy link
Contributor Author

MineCake147E commented May 26, 2023

Here's the latest number measured in benchmarks proposed in dotnet/performance#2950 .

Comparison

summary:
better: 6, geomean: 2.664
total diff: 6

No Slower results for the provided threshold = 1% and noise filter = 0.3 ns.

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_Half.SingleToHalf(value: 6.097555E-05) 4.53 7.94 1.75
System.Tests.Perf_Half.SingleToHalf(value: 12345) 3.21 5.59 1.74
System.Tests.Perf_Half.HalfToSingle(value: 6.1E-05) 2.69 2.83 1.05
System.Tests.Perf_Half.SingleToHalf(value: 65520) 2.43 4.26 1.75
System.Tests.Perf_Half.HalfToSingle(value: NaN) 2.15 2.26 1.05 several?
System.Tests.Perf_Half.HalfToSingle(value: 12344) 1.75 2.26 1.29

Before

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-IQKHZZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method value Mean Error StdDev Median Min Max Allocated
HalfToSingle 12344 2.268 ns 0.0614 ns 0.0513 ns 2.264 ns 2.177 ns 2.337 ns -
SingleToHalf NaN 1.580 ns 0.0509 ns 0.0451 ns 1.580 ns 1.497 ns 1.656 ns -
SingleToHalf 6.097555E-05 7.954 ns 0.1442 ns 0.1204 ns 7.937 ns 7.770 ns 8.174 ns -
SingleToHalf 12345 5.607 ns 0.1451 ns 0.1212 ns 5.592 ns 5.380 ns 5.841 ns -
HalfToSingle 6.1E-05 2.823 ns 0.0668 ns 0.0624 ns 2.835 ns 2.719 ns 2.956 ns -
SingleToHalf 65520 4.261 ns 0.0727 ns 0.0644 ns 4.259 ns 4.144 ns 4.388 ns -
HalfToSingle NaN 2.268 ns 0.0600 ns 0.0562 ns 2.256 ns 2.199 ns 2.390 ns -

After

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-UTBGNX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method value Mean Error StdDev Median Min Max Allocated
HalfToSingle 12344 1.295 ns 0.0448 ns 0.0397 ns 1.294 ns 1.2403 ns 1.363 ns -
SingleToHalf NaN 1.893 ns 0.1261 ns 0.1402 ns 1.894 ns 1.7108 ns 2.223 ns -
SingleToHalf 6.097555E-05 1.760 ns 0.0529 ns 0.0469 ns 1.752 ns 1.6933 ns 1.856 ns -
SingleToHalf 12345 1.749 ns 0.0372 ns 0.0348 ns 1.742 ns 1.6993 ns 1.816 ns -
HalfToSingle 6.1E-05 1.061 ns 0.0495 ns 0.0463 ns 1.055 ns 1.0077 ns 1.161 ns -
SingleToHalf 65520 1.752 ns 0.0386 ns 0.0361 ns 1.752 ns 1.7014 ns 1.825 ns -
HalfToSingle NaN 1.056 ns 0.0459 ns 0.0407 ns 1.049 ns 0.9778 ns 1.118 ns -

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not going to lie, I am not an expert in numerics. But since both Half and Single are relatively small, I've written a small app that emits all possible inputs for casting and compares the old vs new result. It's available here. Note: I know it could be optimized, I just did not want to introduce bug.

The tests have passed, the new code produces exactly the same output for the same input.

Moreover, I've locally synced your branch and removed the aggressive inlining attribute (please see my comment) and benchmarked it by using the benchmarks from dotnet/performance#2950

For all cases except of NaN, the performance has improved. For NaNs the regression is acceptable (it's very small and I would not expect NaN to be a common input)

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

LaunchCount=3
Method Job value Mean Ratio
HalfToSingle PR 12344 1.8440 ns 1.01
HalfToSingle main 12344 1.8201 ns 1.00
SingleToHalf PR NaN 1.3870 ns 1.51
SingleToHalf main NaN 0.9199 ns 1.00
SingleToHalf PR 6.097555E-05 1.3737 ns 0.37
SingleToHalf main 6.097555E-05 3.7275 ns 1.00
SingleToHalf PR 12345 1.3651 ns 0.53
SingleToHalf main 12345 2.6000 ns 1.00
HalfToSingle PR 6.1E-05 1.7831 ns 0.79
HalfToSingle main 6.1E-05 2.2691 ns 1.00
SingleToHalf PR 65520 1.3917 ns 0.67
SingleToHalf main 65520 2.0886 ns 1.00
HalfToSingle PR NaN 1.8133 ns 1.14
HalfToSingle main NaN 1.5873 ns 1.00

@MineCake147E thank you for your contribution! As soon as you apply my suggestion I am going to approve and merge the PR.

src/libraries/System.Private.CoreLib/src/System/Half.cs Outdated Show resolved Hide resolved
@adamsitnik adamsitnik added tenet-performance Performance related issue needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 7, 2023
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 7, 2023
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, again thank you for your contribution @MineCake147E !

@MineCake147E
Copy link
Contributor Author

I am wondering why the NaN cases have regressed more than I expected in Zen2.
Is it due to differences in microarchitecture?

@adamsitnik adamsitnik self-assigned this Jul 7, 2023
@adamsitnik adamsitnik merged commit 5a03596 into dotnet:main Jul 7, 2023
166 checks passed
@adamsitnik adamsitnik added this to the 8.0.0 milestone Jul 7, 2023
@adamsitnik
Copy link
Member

I am wondering why the NaN cases have regressed more than I expected in Zen2.

I don't know and I am sorry because I currently don't have the free cycles to run a dedicated investigation for that.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize conversions between Half and Single
7 participants