Optimize conversions between `Half` and `Single` #69667

MineCake147E · 2022-05-23T08:09:58Z

Description

Currently the conversion between Half and float is only implemented in software, leading to performance issues.
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

Regression?

No

Data

I benchmarked the code below.
EDIT: Removed data biases.
EDIT2: Added random permutation.

Benchmark code for Half to Single conversion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    public enum InputValueType
    {
        Sequential,
        Permuted,
        RandomUniform,
        RandomSubnormal,
        RandomNormal,
        RandomInfNaN
    }

    [CategoriesColumn]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class HalfToSingleConversionBenchmarks
    {
        private const string CategorySimple = "Simple";
        private const string CategoryStandard = "Standard";
        private const string CategoryUnrolled = "Unrolled";

        private Half[] bufferA;
        private float[] bufferDst;

        [Params(65536)]
        public int Frames { get; set; }
        [Params(InputValueType.Sequential, InputValueType.Permuted)]
        public InputValueType InputValue { get; set; }
        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            var bA = bufferA = new Half[samples];
            var spanA = bA.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(spanA);
                    ref var x9 = ref MemoryMarshal.GetReference(spanA);
                    var length = spanA.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0x7fe);
                        spanA[i] = BitConverter.UInt16BitsToHalf(ushort.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0xF000);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(ushort.RotateRight(r, 1) + 0x0400u));
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = BitConverter.HalfToUInt16Bits(spanA[i]);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(r | 0x7c00u));
                    }
                    break;
                default:
                    FillSequential(spanA);
                    break;
            }
            static void FillSequential(Span<Half> spanA)
            {
                for (var i = 0; i < spanA.Length; i++)
                {
                    spanA[i] = BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }

        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        #region Unrolled

        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void UnrolledLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	Code Size
SimpleLoopStandard	Simple,Standard	65536	Sequential	180.1 μs	1.51 μs	1.41 μs	1.00	298 B

UnrolledLoopStandard	Unrolled,Standard	65536	Sequential	196.4 μs	1.40 μs	1.24 μs	1.00	397 B

SimpleLoopStandard	Simple,Standard	65536	Permuted	372.2 μs	2.63 μs	2.33 μs	1.00	298 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	385.0 μs	1.05 μs	0.87 μs	1.00	397 B

The conversion of sequential values seems to be accelerated in some way, such as branch prediction.

Benchmark code for Single to Half conversion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    [CategoriesColumn]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class SingleToHalfConversionBenchmarks
    {
        [Params(65536)]
        public int Frames { get; set; }

        [ParamsAllValues]
        public InputValueType InputValue { get; set; }

        private const string CategorySimple = "Simple";
        private const string CategoryUnrolled = "Unrolled";
        private const string CategoryVectorized = "Vectorized";
        private const string CategoryStandard = "Standard";

        private float[] bufferSrc;
        private Half[] bufferDst;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            var vS = bufferSrc = new float[samples];
            bufferDst = new Half[samples];
            var vspan = vS.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(vspan);
                    //Random Permutation
                    ref var x9 = ref MemoryMarshal.GetReference(vspan);
                    var length = vspan.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x70FF_BFFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x1E00_1FFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1) + 947904512u);
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = BitConverter.SingleToUInt32Bits(vspan[i]);
                        vspan[i] = BitConverter.UInt32BitsToSingle(r | 0x7f80_0000u);
                    }
                    break;
                default:
                    FillSequential(vspan);
                    break;
            }

            static void FillSequential(Span<float> vspan)
            {
                for (var i = 0; i < vspan.Length; i++)
                {
                    vspan[i] = (float)BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }
        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #region Unrolled
        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark]
        public void UnrolledLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (Half)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (Half)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (Half)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (Half)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SimpleLoopStandard	Simple,Standard	65536	Sequential	634.7 μs	4.77 μs	4.23 μs	1.00	0.00	592 B

UnrolledLoopStandard	Unrolled,Standard	65536	Sequential	619.9 μs	2.95 μs	2.62 μs	?	?	699 B

SimpleLoopStandard	Simple,Standard	65536	Permuted	674.4 μs	1.47 μs	1.22 μs	1.00	0.00	592 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	675.3 μs	6.84 μs	6.06 μs	?	?	699 B

Analysis

Converting `Half` to `float`

The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback converting Half to float

EDIT: The previously proposed algorithm turned out to be slower with new input data.
The code below converts Half to float about twice faster than the current implementation.
I've tested this code in test project for all possible 65536 Half values.

using System.Runtime.CompilerServices;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
        public static float ConvertHalfToSingle2(Half value)
        {
            const uint ExponentLowerBound = 0x3880_0000u;   //The smallest positive normal number in Half, converted to Single
            const uint ExponentOffset = 0x3800_0000u;       //BitConverter.SingleToUInt32Bits(1.0f) - ((uint)BitConverter.HalfToUInt16Bits((Half)1.0f) << 13)
            const uint FloatSignMask = 0x8000_0000u;        //Mask for sign bit in Single
            var h = BitConverter.HalfToInt16Bits(value);    //Extract the internal representation of value
            var v = (uint)(int)h;   //Copy sign bit to upper bits
            var e = v & 0x7c00u;    //Extract exponent bits of value
            var c = e == 0u;        //true when value is subnormal
            var hc = (uint)-Unsafe.As<bool, byte>(ref c);   //~0u when c is true, 0 otherwise
            var b = e == 0x7c00u;   //true when value is either Infinity or NaN
            var hb = (uint)-Unsafe.As<bool, byte>(ref b);   //~0u when b is true, 0 otherwise
            var n = hc & ExponentLowerBound;    //n is 0x3880_0000u if c is true, 0 otherwise
            var j = ExponentOffset | n;         //j is now 0x3880_0000u if value is subnormal, 0x3800_0000u otherwise
            v <<= 13;                           //Match the position of the boundary of exponent bits and fraction bits with IEEE 754 Binary32(Single)
            j += j & hb;                        //Double the j if value is either Infinity or NaN
            var s = v & FloatSignMask;          //Extract sign bit of value
            v &= 0x0FFF_E000;                   //Extract exponent bits and fraction bits of value
            v += j;                             //Adjust exponent to match the range of exponent
            var k = BitConverter.SingleToUInt32Bits(BitConverter.UInt32BitsToSingle(v) - BitConverter.UInt32BitsToSingle(n));   //If value is subnormal, remove unnecessary 1 on top of fraction bits.
            return BitConverter.UInt32BitsToSingle(k | s);  //Merge sign bit with rest
        }
    }
}

Test and benchmark code is available in this repository, along with several alternative approaches.

The result is:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SimpleLoopStandard	Simple,Standard	65536	Permuted	389.1 μs	3.36 μs	2.98 μs	1.00	0.00	298 B

SimpleLoopNew2	Simple,New2	65536	Permuted	169.8 μs	1.10 μs	1.03 μs	?	?	223 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	388.2 μs	2.54 μs	2.37 μs	1.00	0.00	397 B

UnrolledLoopNew2	Unrolled,New2	65536	Permuted	154.5 μs	3.05 μs	2.85 μs	?	?	745 B

Converting `float` to `Half`

The current code has a lot of branches, which leads to possible inefficiency.
Again, by getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback converting float to Half

The code below converts float to Half twice faster than the current implementation.
I've tested this code in test project for all possible 4,294,967,296 float values.

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        //Among several approaches, I selected the fastest one (excluding vectorized ones).
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static Half ConvertSingleToHalf4(float value)
        {
            var v0 = Vector128.CreateScalarUnsafe(0x3880_0000u); //Minimum exponent for rounding
            var v1 = Vector128.CreateScalarUnsafe(0x3800_0000u); //Exponent displacement #1
            var v2 = Vector128.CreateScalarUnsafe(0x8000_0000u); //Sign bit
            var v3 = Vector128.CreateScalarUnsafe(0x7f80_0000u); //Exponent mask
            var v4 = Vector128.CreateScalarUnsafe(0x0680_0000u); //Exponent displacement #2
            var v5 = Vector128.CreateScalarUnsafe(65520.0f);     //Maximum value that is not Infinity in Half
            var v = BitConverter.SingleToUInt32Bits(value);
            var vval = Vector128.CreateScalarUnsafe(value);
            vval = (vval.AsUInt32() & ~v2).AsSingle();  //Clear sign bit
            var s = v & 0x8000_0000u;       //Extract sign bit
            vval = Vector128.Min(v5, vval); //Rectify values that are Infinity in Half
            var w = Vector128.Equals(vval, vval).AsUInt32();   //Detecting NaN(a != a if a is NaN)
            var y = Vector128.Max(v0, vval.AsUInt32()); //Rectify lower exponent
            y &= v3;        //Extract exponent
            y += v4;        //Add exponent by 13
            var z = y - v1; //Subtract exponent from y by 112
            z &= w;         //Zero whole z if value is NaN
            vval += y.AsSingle();                       //Round Single into Half's precision(NaN also gets modified here, just setting the MSB of fraction)
            vval = (vval.AsUInt32() - v1).AsSingle();   //Subtract exponent by 112
            vval -= z.AsSingle();                       //Clear Extra leading 1 set in rounding
            v = vval.AsUInt32().GetElement(0) >> 13;    //Now internal representation is the absolute value represented in Half, shifted 13 bits left, with some exceptions like NaN having strange exponents
            s >>>= 16;                              //Match the position of sign bit
            var hc = ~w.GetElement(0) & 0x7C00u;    //Only exponent bits will be modified if NaN
            v &= 0x7fffu;       //Clear the upper unnecessary bits
            var gc = hc | s;    //Merge sign bit with possible NaN exponent
            v &= ~hc;           //Clear exponents if value is NaN
            v |= gc;            //Merge sign bit and possible NaN exponent
            return BitConverter.UInt16BitsToHalf((ushort)v);    //The final result
        }
    }
}

Test and benchmark code is available in this repository, along with several alternative approaches.
The benchmark result is as follows:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200-preview.22628.1
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT
  DefaultJob : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SimpleLoopStandard	Simple,Standard	65536	Permuted	686.6 μs	5.04 μs	4.47 μs	1.00	0.00	592 B

SimpleLoopNew4A	Simple,New4,AggressiveInlining	65536	Permuted	327.3 μs	1.78 μs	1.58 μs	?	?	370 B

SimpleLoopNew4U	Simple,New4,InliningUnspecified	65536	Permuted	357.4 μs	2.60 μs	2.43 μs	?	?	275 B

SimpleLoopNew4N	Simple,New4,NoInlining	65536	Permuted	359.1 μs	2.61 μs	2.45 μs	?	?	275 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	676.7 μs	5.00 μs	4.68 μs	?	?	699 B

UnrolledLoopNew4A	Unrolled,New4,AggressiveInlining	65536	Permuted	301.1 μs	2.68 μs	2.38 μs	?	?	1,088 B

UnrolledLoopNew4U	Unrolled,New4,InliningUnspecified	65536	Permuted	354.0 μs	2.93 μs	2.60 μs	?	?	382 B

UnrolledLoopNew4N	Unrolled,New4,NoInlining	65536	Permuted	355.0 μs	3.19 μs	2.83 μs	?	?	382 B

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-05-23T08:10:01Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-05-23T14:34:15Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Currently the conversion between Half and float is only implemented in software, leading to performance issues.
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
  [Host]     : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Regression?

No

Data

I benchmarked the code below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class HalfToSingleConversionBenchmarks
    {
        [Params(65535)]
        public int Frames { get; set; }

        private float[] bufferDst;
        private Half[] bufferA;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            bufferA = new Half[samples];
            bufferA.AsSpan().Fill((Half)1.5f);
        }

        [Benchmark]
        public void SimpleLoop()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        [Benchmark]
        public void UnrolledLoop()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
    }
}

Method	Frames	Mean	Error	StdDev	Code Size
SimpleLoop	65535	223.9 μs	1.58 μs	1.40 μs	314 B
UnrolledLoop	65535	205.6 μs	0.89 μs	0.74 μs	432 B

Analysis

The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback

I wrote this code for conversion from Half to float by converting it to double first.
I've tested this code in test project for all possible 65536 Half values.

using System.Runtime.CompilerServices;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
        public static float ConvertHalfToSingle(Half value)
        {
            var h = BitConverter.HalfToInt16Bits(value);
            var v = (uint)(int)h;
            var b = (v & 0x7c00u) == 0x7c00u;
            var hb = (ulong)-(long)Unsafe.As<bool, byte>(ref b);
            v <<= 13;
            v &= 0x8FFF_E000;
            var j = 0x0700000000000000ul + (hb & 0x3F00000000000000ul);
            var d = BitConverter.DoubleToUInt64Bits((double)BitConverter.UInt32BitsToSingle(v));
            d += j;
            return (float)BitConverter.UInt64BitsToDouble(d);
        }
    }
}

Test code:

using System;

using BetterHalfToSingleConversion;

using NUnit.Framework;

namespace BetterHalfConversionTests
{
    [TestFixture]
    public class BetterHalfToSingleConversionTests
    {

        [Test]
        public void ConvertHalfToSingleConvertsAllValuesCorrectly()
        {
            for (uint i = 0; i <= ushort.MaxValue; i++)
            {
                var h = BitConverter.UInt16BitsToHalf((ushort)i);
                var exp = (float)h;
                var act = HalfUtils.ConvertHalfToSingle(h);
                Assert.AreEqual(exp, act, $"Evaluating {i}th value:");
            }
        }
    }
}

And benchmarked with:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

using BetterHalfToSingleConversion;

namespace HalfConversionBenchmarks
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class HalfToSingleConversionBenchmarks
    {
        [Params(65535)]
        public int Frames { get; set; }

        private float[] bufferDst;
        private Half[] bufferA;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            bufferA = new Half[samples];
            bufferA.AsSpan().Fill((Half)1.5f);
        }

        [Benchmark]
        public void SimpleLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        [Benchmark]
        public void UnrolledLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
        [Benchmark]
        public void SimpleLoopNew()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
            }
        }

        [Benchmark]
        public void UnrolledLoopNew()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 0));
                Unsafe.Add(ref rdi, i + 1) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 1));
                Unsafe.Add(ref rdi, i + 2) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 2));
                Unsafe.Add(ref rdi, i + 3) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 3));
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
            }
        }

    }
}

And result is:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
  [Host]     : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Method	Frames	Mean	Error	StdDev	Code Size
SimpleLoopStandard	65535	223.1 μs	3.10 μs	2.75 μs	314 B
UnrolledLoopStandard	65535	220.5 μs	1.13 μs	1.06 μs	432 B
SimpleLoopNew	65535	156.4 μs	0.81 μs	0.76 μs	211 B
UnrolledLoopNew	65535	141.3 μs	0.99 μs	0.93 μs	686 B

I also added a new repository with some alternative approach.

Author:	MineCake147E
Assignees:	-
Labels:	`area-System.Runtime`, `tenet-performance`, `untriaged`
Milestone:	-

Closes dotnet#69667.

Closes #69667.

MichalPetryka · 2023-07-07T15:02:06Z

@adamsitnik I think this should maybe be left open until the half conversions get F16C/AVX512-FP16 acceleration.

MineCake147E added the tenet-performance Performance related issue label May 23, 2022

ghost added the untriaged New issue has not been triaged by the area owner label May 23, 2022

jeffschwMSFT added the area-System.Runtime label May 23, 2022

tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 15, 2022

tannergooding added this to the Future milestone Jul 15, 2022

MineCake147E changed the title ~~Suboptimal Implementation of Half to Single conversion~~ Conversion between Half and Single is suboptimally implemented Nov 11, 2022

MineCake147E changed the title ~~Conversion between Half and Single is suboptimally implemented~~ Optimize conversions between Half and Single Nov 15, 2022

MineCake147E added a commit to MineCake147E/runtime that referenced this issue Feb 4, 2023

Optimized conversions between Half and Single.

0190f96

Closes dotnet#69667.

MineCake147E mentioned this issue Feb 4, 2023

Optimized conversions between Half and Single. #81632

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 4, 2023

ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 29, 2023

adamsitnik closed this as completed in #81632 Jul 7, 2023

adamsitnik pushed a commit that referenced this issue Jul 7, 2023

Optimized conversions between Half and Single. (#81632)

5a03596

Closes #69667.

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 7, 2023

adamsitnik assigned MineCake147E Jul 7, 2023

adamsitnik modified the milestones: Future, 8.0.0 Jul 7, 2023

ghost locked as resolved and limited conversation to collaborators Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize conversions between `Half` and `Single` #69667

Optimize conversions between `Half` and `Single` #69667

MineCake147E commented May 23, 2022 •

edited

Loading

dotnet-issue-labeler bot commented May 23, 2022

ghost commented May 23, 2022

Description

Configuration

Regression?

Data

Analysis

MichalPetryka commented Jul 7, 2023

Optimize conversions between Half and Single #69667

Optimize conversions between Half and Single #69667

Comments

MineCake147E commented May 23, 2022 • edited Loading

Description

Configuration

Regression?

Data

Analysis

Converting Half to float

Converting float to Half

dotnet-issue-labeler bot commented May 23, 2022

ghost commented May 23, 2022

Description

Configuration

Regression?

Data

Analysis

MichalPetryka commented Jul 7, 2023

Optimize conversions between `Half` and `Single` #69667

Optimize conversions between `Half` and `Single` #69667

MineCake147E commented May 23, 2022 •

edited

Loading

Converting `Half` to `float`

Converting `float` to `Half`