Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize conversions between Half and Single #69667

Closed
MineCake147E opened this issue May 23, 2022 · 3 comments · Fixed by #81632
Closed

Optimize conversions between Half and Single #69667

MineCake147E opened this issue May 23, 2022 · 3 comments · Fixed by #81632
Assignees
Labels
Milestone

Comments

@MineCake147E
Copy link
Contributor

MineCake147E commented May 23, 2022

Description

Currently the conversion between Half and float is only implemented in software, leading to performance issues.
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

Regression?

No

Data

I benchmarked the code below.
EDIT: Removed data biases.
EDIT2: Added random permutation.

Benchmark code for Half to Single conversion
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    public enum InputValueType
    {
        Sequential,
        Permuted,
        RandomUniform,
        RandomSubnormal,
        RandomNormal,
        RandomInfNaN
    }

    [CategoriesColumn]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class HalfToSingleConversionBenchmarks
    {
        private const string CategorySimple = "Simple";
        private const string CategoryStandard = "Standard";
        private const string CategoryUnrolled = "Unrolled";

        private Half[] bufferA;
        private float[] bufferDst;

        [Params(65536)]
        public int Frames { get; set; }
        [Params(InputValueType.Sequential, InputValueType.Permuted)]
        public InputValueType InputValue { get; set; }
        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            var bA = bufferA = new Half[samples];
            var spanA = bA.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(spanA);
                    ref var x9 = ref MemoryMarshal.GetReference(spanA);
                    var length = spanA.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0x7fe);
                        spanA[i] = BitConverter.UInt16BitsToHalf(ushort.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0xF000);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(ushort.RotateRight(r, 1) + 0x0400u));
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = BitConverter.HalfToUInt16Bits(spanA[i]);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(r | 0x7c00u));
                    }
                    break;
                default:
                    FillSequential(spanA);
                    break;
            }
            static void FillSequential(Span<Half> spanA)
            {
                for (var i = 0; i < spanA.Length; i++)
                {
                    spanA[i] = BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }

        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        #region Unrolled

        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void UnrolledLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}
Method Categories Frames InputValue Mean Error StdDev Ratio Code Size
SimpleLoopStandard Simple,Standard 65536 Sequential 180.1 μs 1.51 μs 1.41 μs 1.00 298 B
UnrolledLoopStandard Unrolled,Standard 65536 Sequential 196.4 μs 1.40 μs 1.24 μs 1.00 397 B
SimpleLoopStandard Simple,Standard 65536 Permuted 372.2 μs 2.63 μs 2.33 μs 1.00 298 B
UnrolledLoopStandard Unrolled,Standard 65536 Permuted 385.0 μs 1.05 μs 0.87 μs 1.00 397 B

The conversion of sequential values seems to be accelerated in some way, such as branch prediction.

Benchmark code for Single to Half conversion
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    [CategoriesColumn]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class SingleToHalfConversionBenchmarks
    {
        [Params(65536)]
        public int Frames { get; set; }

        [ParamsAllValues]
        public InputValueType InputValue { get; set; }

        private const string CategorySimple = "Simple";
        private const string CategoryUnrolled = "Unrolled";
        private const string CategoryVectorized = "Vectorized";
        private const string CategoryStandard = "Standard";

        private float[] bufferSrc;
        private Half[] bufferDst;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            var vS = bufferSrc = new float[samples];
            bufferDst = new Half[samples];
            var vspan = vS.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(vspan);
                    //Random Permutation
                    ref var x9 = ref MemoryMarshal.GetReference(vspan);
                    var length = vspan.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x70FF_BFFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x1E00_1FFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1) + 947904512u);
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = BitConverter.SingleToUInt32Bits(vspan[i]);
                        vspan[i] = BitConverter.UInt32BitsToSingle(r | 0x7f80_0000u);
                    }
                    break;
                default:
                    FillSequential(vspan);
                    break;
            }

            static void FillSequential(Span<float> vspan)
            {
                for (var i = 0; i < vspan.Length; i++)
                {
                    vspan[i] = (float)BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }
        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #region Unrolled
        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark]
        public void UnrolledLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (Half)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (Half)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (Half)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (Half)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}
Method Categories Frames InputValue Mean Error StdDev Ratio RatioSD Code Size
SimpleLoopStandard Simple,Standard 65536 Sequential 634.7 μs 4.77 μs 4.23 μs 1.00 0.00 592 B
UnrolledLoopStandard Unrolled,Standard 65536 Sequential 619.9 μs 2.95 μs 2.62 μs ? ? 699 B
SimpleLoopStandard Simple,Standard 65536 Permuted 674.4 μs 1.47 μs 1.22 μs 1.00 0.00 592 B
UnrolledLoopStandard Unrolled,Standard 65536 Permuted 675.3 μs 6.84 μs 6.06 μs ? ? 699 B

Analysis

Converting Half to float

The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback converting Half to float

EDIT: The previously proposed algorithm turned out to be slower with new input data.
The code below converts Half to float about twice faster than the current implementation.
I've tested this code in test project for all possible 65536 Half values.

using System.Runtime.CompilerServices;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
        public static float ConvertHalfToSingle2(Half value)
        {
            const uint ExponentLowerBound = 0x3880_0000u;   //The smallest positive normal number in Half, converted to Single
            const uint ExponentOffset = 0x3800_0000u;       //BitConverter.SingleToUInt32Bits(1.0f) - ((uint)BitConverter.HalfToUInt16Bits((Half)1.0f) << 13)
            const uint FloatSignMask = 0x8000_0000u;        //Mask for sign bit in Single
            var h = BitConverter.HalfToInt16Bits(value);    //Extract the internal representation of value
            var v = (uint)(int)h;   //Copy sign bit to upper bits
            var e = v & 0x7c00u;    //Extract exponent bits of value
            var c = e == 0u;        //true when value is subnormal
            var hc = (uint)-Unsafe.As<bool, byte>(ref c);   //~0u when c is true, 0 otherwise
            var b = e == 0x7c00u;   //true when value is either Infinity or NaN
            var hb = (uint)-Unsafe.As<bool, byte>(ref b);   //~0u when b is true, 0 otherwise
            var n = hc & ExponentLowerBound;    //n is 0x3880_0000u if c is true, 0 otherwise
            var j = ExponentOffset | n;         //j is now 0x3880_0000u if value is subnormal, 0x3800_0000u otherwise
            v <<= 13;                           //Match the position of the boundary of exponent bits and fraction bits with IEEE 754 Binary32(Single)
            j += j & hb;                        //Double the j if value is either Infinity or NaN
            var s = v & FloatSignMask;          //Extract sign bit of value
            v &= 0x0FFF_E000;                   //Extract exponent bits and fraction bits of value
            v += j;                             //Adjust exponent to match the range of exponent
            var k = BitConverter.SingleToUInt32Bits(BitConverter.UInt32BitsToSingle(v) - BitConverter.UInt32BitsToSingle(n));   //If value is subnormal, remove unnecessary 1 on top of fraction bits.
            return BitConverter.UInt32BitsToSingle(k | s);  //Merge sign bit with rest
        }
    }
}

Test and benchmark code is available in this repository, along with several alternative approaches.

The result is:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

Method Categories Frames InputValue Mean Error StdDev Ratio RatioSD Code Size
SimpleLoopStandard Simple,Standard 65536 Permuted 389.1 μs 3.36 μs 2.98 μs 1.00 0.00 298 B
SimpleLoopNew2 Simple,New2 65536 Permuted 169.8 μs 1.10 μs 1.03 μs ? ? 223 B
UnrolledLoopStandard Unrolled,Standard 65536 Permuted 388.2 μs 2.54 μs 2.37 μs 1.00 0.00 397 B
UnrolledLoopNew2 Unrolled,New2 65536 Permuted 154.5 μs 3.05 μs 2.85 μs ? ? 745 B

Converting float to Half

The current code has a lot of branches, which leads to possible inefficiency.
Again, by getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback converting float to Half

The code below converts float to Half twice faster than the current implementation.
I've tested this code in test project for all possible 4,294,967,296 float values.

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        //Among several approaches, I selected the fastest one (excluding vectorized ones).
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static Half ConvertSingleToHalf4(float value)
        {
            var v0 = Vector128.CreateScalarUnsafe(0x3880_0000u); //Minimum exponent for rounding
            var v1 = Vector128.CreateScalarUnsafe(0x3800_0000u); //Exponent displacement #1
            var v2 = Vector128.CreateScalarUnsafe(0x8000_0000u); //Sign bit
            var v3 = Vector128.CreateScalarUnsafe(0x7f80_0000u); //Exponent mask
            var v4 = Vector128.CreateScalarUnsafe(0x0680_0000u); //Exponent displacement #2
            var v5 = Vector128.CreateScalarUnsafe(65520.0f);     //Maximum value that is not Infinity in Half
            var v = BitConverter.SingleToUInt32Bits(value);
            var vval = Vector128.CreateScalarUnsafe(value);
            vval = (vval.AsUInt32() & ~v2).AsSingle();  //Clear sign bit
            var s = v & 0x8000_0000u;       //Extract sign bit
            vval = Vector128.Min(v5, vval); //Rectify values that are Infinity in Half
            var w = Vector128.Equals(vval, vval).AsUInt32();   //Detecting NaN(a != a if a is NaN)
            var y = Vector128.Max(v0, vval.AsUInt32()); //Rectify lower exponent
            y &= v3;        //Extract exponent
            y += v4;        //Add exponent by 13
            var z = y - v1; //Subtract exponent from y by 112
            z &= w;         //Zero whole z if value is NaN
            vval += y.AsSingle();                       //Round Single into Half's precision(NaN also gets modified here, just setting the MSB of fraction)
            vval = (vval.AsUInt32() - v1).AsSingle();   //Subtract exponent by 112
            vval -= z.AsSingle();                       //Clear Extra leading 1 set in rounding
            v = vval.AsUInt32().GetElement(0) >> 13;    //Now internal representation is the absolute value represented in Half, shifted 13 bits left, with some exceptions like NaN having strange exponents
            s >>>= 16;                              //Match the position of sign bit
            var hc = ~w.GetElement(0) & 0x7C00u;    //Only exponent bits will be modified if NaN
            v &= 0x7fffu;       //Clear the upper unnecessary bits
            var gc = hc | s;    //Merge sign bit with possible NaN exponent
            v &= ~hc;           //Clear exponents if value is NaN
            v |= gc;            //Merge sign bit and possible NaN exponent
            return BitConverter.UInt16BitsToHalf((ushort)v);    //The final result
        }
    }
}

Test and benchmark code is available in this repository, along with several alternative approaches.
The benchmark result is as follows:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200-preview.22628.1
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT
  DefaultJob : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT

Method Categories Frames InputValue Mean Error StdDev Ratio RatioSD Code Size
SimpleLoopStandard Simple,Standard 65536 Permuted 686.6 μs 5.04 μs 4.47 μs 1.00 0.00 592 B
SimpleLoopNew4A Simple,New4,AggressiveInlining 65536 Permuted 327.3 μs 1.78 μs 1.58 μs ? ? 370 B
SimpleLoopNew4U Simple,New4,InliningUnspecified 65536 Permuted 357.4 μs 2.60 μs 2.43 μs ? ? 275 B
SimpleLoopNew4N Simple,New4,NoInlining 65536 Permuted 359.1 μs 2.61 μs 2.45 μs ? ? 275 B
UnrolledLoopStandard Unrolled,Standard 65536 Permuted 676.7 μs 5.00 μs 4.68 μs ? ? 699 B
UnrolledLoopNew4A Unrolled,New4,AggressiveInlining 65536 Permuted 301.1 μs 2.68 μs 2.38 μs ? ? 1,088 B
UnrolledLoopNew4U Unrolled,New4,InliningUnspecified 65536 Permuted 354.0 μs 2.93 μs 2.60 μs ? ? 382 B
UnrolledLoopNew4N Unrolled,New4,NoInlining 65536 Permuted 355.0 μs 3.19 μs 2.83 μs ? ? 382 B
@MineCake147E MineCake147E added the tenet-performance Performance related issue label May 23, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 23, 2022
@ghost
Copy link

ghost commented May 23, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Currently the conversion between Half and float is only implemented in software, leading to performance issues.
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
  [Host]     : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Regression?

No

Data

I benchmarked the code below.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class HalfToSingleConversionBenchmarks
    {
        [Params(65535)]
        public int Frames { get; set; }

        private float[] bufferDst;
        private Half[] bufferA;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            bufferA = new Half[samples];
            bufferA.AsSpan().Fill((Half)1.5f);
        }

        [Benchmark]
        public void SimpleLoop()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        [Benchmark]
        public void UnrolledLoop()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
    }
}
Method Frames Mean Error StdDev Code Size
SimpleLoop 65535 223.9 μs 1.58 μs 1.40 μs 314 B
UnrolledLoop 65535 205.6 μs 0.89 μs 0.74 μs 432 B

Analysis

The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.

My proposal for new software fallback

I wrote this code for conversion from Half to float by converting it to double first.
I've tested this code in test project for all possible 65536 Half values.

using System.Runtime.CompilerServices;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
        public static float ConvertHalfToSingle(Half value)
        {
            var h = BitConverter.HalfToInt16Bits(value);
            var v = (uint)(int)h;
            var b = (v & 0x7c00u) == 0x7c00u;
            var hb = (ulong)-(long)Unsafe.As<bool, byte>(ref b);
            v <<= 13;
            v &= 0x8FFF_E000;
            var j = 0x0700000000000000ul + (hb & 0x3F00000000000000ul);
            var d = BitConverter.DoubleToUInt64Bits((double)BitConverter.UInt32BitsToSingle(v));
            d += j;
            return (float)BitConverter.UInt64BitsToDouble(d);
        }
    }
}

Test code:

using System;

using BetterHalfToSingleConversion;

using NUnit.Framework;

namespace BetterHalfConversionTests
{
    [TestFixture]
    public class BetterHalfToSingleConversionTests
    {

        [Test]
        public void ConvertHalfToSingleConvertsAllValuesCorrectly()
        {
            for (uint i = 0; i <= ushort.MaxValue; i++)
            {
                var h = BitConverter.UInt16BitsToHalf((ushort)i);
                var exp = (float)h;
                var act = HalfUtils.ConvertHalfToSingle(h);
                Assert.AreEqual(exp, act, $"Evaluating {i}th value:");
            }
        }
    }
}

And benchmarked with:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

using BetterHalfToSingleConversion;

namespace HalfConversionBenchmarks
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class HalfToSingleConversionBenchmarks
    {
        [Params(65535)]
        public int Frames { get; set; }

        private float[] bufferDst;
        private Half[] bufferA;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            bufferA = new Half[samples];
            bufferA.AsSpan().Fill((Half)1.5f);
        }

        [Benchmark]
        public void SimpleLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        [Benchmark]
        public void UnrolledLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
        [Benchmark]
        public void SimpleLoopNew()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
            }
        }

        [Benchmark]
        public void UnrolledLoopNew()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 0));
                Unsafe.Add(ref rdi, i + 1) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 1));
                Unsafe.Add(ref rdi, i + 2) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 2));
                Unsafe.Add(ref rdi, i + 3) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i + 3));
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = HalfUtils.ConvertHalfToSingle(Unsafe.Add(ref rsi, i));
            }
        }

    }
}

And result is:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.300-preview.22204.3
  [Host]     : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  DefaultJob : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Method Frames Mean Error StdDev Code Size
SimpleLoopStandard 65535 223.1 μs 3.10 μs 2.75 μs 314 B
UnrolledLoopStandard 65535 220.5 μs 1.13 μs 1.06 μs 432 B
SimpleLoopNew 65535 156.4 μs 0.81 μs 0.76 μs 211 B
UnrolledLoopNew 65535 141.3 μs 0.99 μs 0.93 μs 686 B

I also added a new repository with some alternative approach.

Author: MineCake147E
Assignees: -
Labels:

area-System.Runtime, tenet-performance, untriaged

Milestone: -

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 15, 2022
@tannergooding tannergooding added this to the Future milestone Jul 15, 2022
@MineCake147E MineCake147E changed the title Suboptimal Implementation of Half to Single conversion Conversion between Half and Single is suboptimally implemented Nov 11, 2022
@MineCake147E MineCake147E changed the title Conversion between Half and Single is suboptimally implemented Optimize conversions between Half and Single Nov 15, 2022
MineCake147E added a commit to MineCake147E/runtime that referenced this issue Feb 4, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 4, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 29, 2023
adamsitnik pushed a commit that referenced this issue Jul 7, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 7, 2023
@adamsitnik adamsitnik modified the milestones: Future, 8.0.0 Jul 7, 2023
@MichalPetryka
Copy link
Contributor

@adamsitnik I think this should maybe be left open until the half conversions get F16C/AVX512-FP16 acceleration.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants