-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods to convert between hexadecimal strings and bytes #17837
Comments
Thanks for creating this issue @GSPP. cc: @terrajobst |
Replacement is not the ONLY way, you can just skip the char if you need to.
I agree that the format is useless in MOST cases... However if you think about it a little more and try to be as outside the box as possible you will clearly see that there is an extra index for every octet, finally then you can achieve some interesting things by realigning the data or using the extra indexes.... |
@GSPP how about augmenting the BitConverter type with these new methods? I feel like that would be a better place for these since that is already a type people use for converting between various formats. |
There are at least a couple of other issues tracking something like this: Let's track it with a single issue (this one). It does look like this is something for which there is interest in implementing. @GSPP, @GrabYourPitchforks @KrzysztofCwalina is anyone of you interested in writing up a formal API proposal for this? |
If the slices API lands before this, I'd go with something like: // all instance methods are thread-safe
public abstract class BinaryEncoding
{
protected abstract void ToBytesOverride(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
protected abstract int GetByteLengthOverride(int stringLength);
protected abstract void ToCharsOverride(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
protected abstract int GetStringLengthOverride(int byteLength);
public byte[] ToBytes(ReadOnlySpan<char> str);
public void ToBytes(ReadOnlySpan<char> sourceChars, Span<byte> destinationBytes);
public int GetByteLength(int stringLength);
public string ToString(ReadOnlySpan<byte> bytes);
public char[] ToChars(ReadOnlySpan<byte> bytes);
public void ToChars(ReadOnlySpan<byte> sourceBytes, Span<char> destinationChars);
public int GetStringLength(int byteLength);
}
public class HexEncoding
{
public static HexEncoding UpperCase { get; }
public static HexEncoding LowerCase { get; }
} perhaps add some additional overloads with |
We don't take any new API additions if they would be obsolete when |
Span has landed so this can be unblocked. |
I propose a slightly different API, instead putting this on the namespace System
{
// NEW methods on EXISTING type
public static class Convert
{
public static byte[] FromHexString(ReadOnlySpan<char> chars); // doesn't match an existing overload of FromBase64String, but could be useful
public static byte[] FromHexString(string s);
public static string ToHexString(byte[] inArray, HexFormattingOptions options = default);
public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default);
public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default);
}
[Flags]
public enum HexFormattingOptions
{
None = 0,
InsertLineBreaks = 1,
Lowercase = 2,
}
} It also makes sense IMO to have |
To give some parity with the public static class Convert
{
public bool TryFromHexString(ReadOnlySpan<char> chars, Span<byte> buffer, out int bytesWritten);
public bool TryFromHexString(string s, Span<byte> buffer, out int bytesWritten);
public bool TryToHexString(ReadOnlySpan<byte> bytes, Span<char> buffer, out int charsWritten, HexFormattingOptions options = default);
} The rationale for this is in my typical usage I expect a hex string to be short (think a SHA digest) which tops out at 64 bytes right now. I'm not strongly tied to these, but worth considering during review. Other considerations, does there need to be an options for the What about continuing from line breaks? |
@vcsjones I envision |
I'd not put it on Should this API accept any other input besides spans? It could accept arrays, strings, We certainly need methods that do not throw in the error case. My original proposal missed those.
Is this a widely known standard format? Is it only for base64 or also for hex? New API proposal:
I chose the name prefix to be "Get" so that "TryGet" is fitting. This succinct naming would not be possible when placing these methods in |
There are many variants of API. It can be used in web,cryptography and many other places. Some hex conversion is used in json for writing non-ASCII chars - \u0000 - \uFFFF I think that .NET Core must provide optimized variants for each purpose Writing API
All above give us more than 100 function declarations. If it will have byte[] inputs - it take much more For example just writing API for ReadOnlySpan as source, with all kinds of separators, and only lowercase give us 6 optimized variants
This is bad idea to place it in existing places, I think, that better to place this API in external netcoreapp2.1+netstandard2.1 package. Something like System.Text.Hex |
Also, as @vcsjones mentioned for short strings, we need some different allocation mechanism, to avoid dirty hacks for reducing allocations count: (https://docs.microsoft.com/en-us/dotnet/csharp/how-to/modify-string-contents#unsafe-modifications-to-string)
For example in this case we can allocate string memory (x2 of byte array length) and write to it, without copy contructor. |
I consolidated the different proposals and comments made in this discussion and came up with the following API. I also provide a sample implementation (internally utilizing AVX2, SSE4.1, SSSE3, SSE2 or Scalar/Lookup with the following performance characteristics) of the API to experiment with via a seperate GitHub repository and nuget package. Sample implementation to play arround withhttps://github.com/tkp1n/HexMate Adressed proposals and comments:
Open questions:
API proposalnamespace System
{
// NEW type
[Flags]
public enum HexFormattingOptions
{
None = 0,
InsertLineBreaks = 1,
Lowercase = 2
}
// NEW methods on EXISTING type
public static class Convert
{
// Decode from chars
public static byte[] FromHexCharArray(char[] inArray, int offset, int length) => throw null;
public static bool TryFromHexChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) => throw null;
// Decode from strings
public static byte[] FromHexString(string s) => throw null;
public static byte[] FromHexString(ReadOnlySpan<char> chars) => throw null;
public static bool TryFromHexString(string s, Span<byte> bytes, out int bytesWritten) => throw null;
// Encode to chars
public static int ToHexCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut, HexFormattingOptions options = default) => throw null;
public static bool TryToHexChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten, HexFormattingOptions options = default) => throw null;
// Encode to strings
public static string ToHexString(byte[] inArray, HexFormattingOptions options = default) => throw null;
public static string ToHexString(byte[] inArray, int offset, int length, HexFormattingOptions options = default) => throw null;
public static string ToHexString(ReadOnlySpan<byte> bytes, HexFormattingOptions options = default) => throw null;
}
}
namespace System.Buffers.Text
{
// NEW type
public static class Hex
{
// Decode
public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten) => throw null;
// Encode
public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) => throw null;
public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten) => throw null;
}
} |
I think we should demonstrate a need for inserting line breaks first. I am not aware of any common scenario where this would be used. If this feature is added then the line length must be variable. I cannot imagine any fixed line length to be appropriate. What are the intended use cases?
By default, the parsing routines should be maximally strict. This means no support for whitespace or any separator. When we parse data we generally want to be as strict as possible in order to find bugs in the originating systems and in order to not accidentally accept invalid data. The motto Parsing should accept any case of the letters I see no need for supporting a separator. This is a rare need. An easy workaround exists at a performance cost: |
@GrabYourPitchforks, it seems we should survey what internal implementations we have and what all the places are where we expose something similar. It seems new proposals were added after you marked it as read for review. |
Really looking forward to seeing this. I benchmarked @tkp1n implementation and it's faster than everything else in .NET Core. Compared to these methods: https://stackoverflow.com/a/624379/2828480 |
I have some follow-up thoughts on this. My instinct is that @tkp1n's comment at #17837 (comment) mostly nails it, but I'd recommend some further changes. First, let's not bother with anything in System.Buffers.Text just yet, reserving this specific issue for new APIs in I spent some time building up a list of places in the framework where we convert bytes to hex strings. See the end of this comment for a partial list. Every single one of these matches can be replaced with a call to one of the Now, for a rude Q and A. What's the primary scenario for this? The 99% use case for converting to hex is to produce a case-insensitive identifier from a binary buffer (say, a GUID or similar). It's also frequently used in cryptography to represent symmetric key material or digests. In particular, the majority of hex use cases are for small-ish inputs and outputs (say, 256 bits max in the common case). That's also about the bounds of what we would expect where identifiers might appear. This means that we don't need to worry about hyper-optimizing any such implementation. Until we get evidence that this particular API is bottlenecking real-world scenarios, a simple for loop should suffice as an implementation. What about customizing joiners or line breaks? For the primary scenario of identifiers and other small-ish payloads, callers typically don't want any type of joiner. Where there are exceptions they tend to be things we can't easily account for in a generalized API. For example, some callers want the format I anticipate that such callers will continue to use their own custom logic. Since they represent the minority use case I'm not too worried about them for this scenario. What about The behavior would be that it allows only Framework matches libraries\Common\tests\System\Net\Http\LoopbackServer.AuthenticationHelpers.cs:
299: sb.Append(b.ToString("x2"));
libraries\Common\tests\System\Security\Cryptography\ByteUtils.cs:
59: builder.Append(bytes[i].ToString("X2"));
libraries\System.CodeDom\src\Microsoft\CSharp\CSharpCodeGenerator.cs:
2512: Output.Write(b.ToString("X2", CultureInfo.InvariantCulture));
libraries\System.CodeDom\src\Microsoft\VisualBasic\VBCodeGenerator.cs:
2273: Output.Write(b.ToString("X2", CultureInfo.InvariantCulture));
libraries\System.DirectoryServices.AccountManagement\src\System\DirectoryServices\AccountManagement\Utils.cs:
62: stringizedArray.Append(b.ToString("x2", CultureInfo.InvariantCulture));
82: stringizedBinarySid.Append(b.ToString("x2", CultureInfo.InvariantCulture));
libraries\System.DirectoryServices.AccountManagement\src\System\DirectoryServices\AccountManagement\AD\ADStoreCtx_Query.cs:
496: stringguid.Append(b.ToString("x2", CultureInfo.InvariantCulture));
630: stringizedBinarySid.Append(b.ToString("x2", CultureInfo.InvariantCulture));
(below implementation is incorrect)
libraries\System.Drawing.Common\tests\mono\System.Drawing\BitmapTests.cs:
397: sOutput.Append(arrInput[i].ToString("X2"));
libraries\System.Net.Http\src\System\Net\Http\SocketsHttpHandler\AuthenticationHelper.Digest.cs:
245: bool formatted = result[i].TryFormat(byteX2, out int charsWritten, "x2");
libraries\System.Private.CoreLib\src\System\Diagnostics\Tracing\TraceLogging\XplatEventLogger.cs:
156: byteArray[i].TryFormat(hex, out int charsWritten, hexFormat);
libraries\System.Private.CoreLib\src\System\Reflection\AssemblyNameFormatter.cs:
75: sb.Append(b.ToString("x2", CultureInfo.InvariantCulture));
libraries\System.Reflection.MetadataLoadContext\tests\src\TestUtils\TestUtils.cs:
127: builder.Append(bytes[i].ToString("X2"));
libraries\System.Security.Cryptography.Xml\src\System\Security\Cryptography\Xml\SignedXmlDebugLog.cs:
203: builder.Append(b.ToString("x2", CultureInfo.InvariantCulture));
libraries\System.Security.Cryptography.Xml\tests\XmlDsigXsltTransformTest.cs:
117: sb.Append(b.ToString("X2"));
libraries\System.Text.Encodings.Web\tests\UrlEncoderTests.cs:
296: return string.Concat(_utf8EncodingThrowOnInvalidBytes.GetBytes(char.ConvertFromUtf32(codePoint)).Select(b => string.Format(CultureInfo.InvariantCulture, "%{0:X2}", b)));
libraries\System.Security.Cryptography.X509Certificates\src\Internal\Cryptography\Helpers.cs:
17: public static char[] ToHexArrayUpper(this byte[] bytes)
24: private static void ToHexArrayUpper(ReadOnlySpan<byte> bytes, Span<char> chars)
36: public static string ToHexStringUpper(this byte[] bytes) => |
@GrabYourPitchforks thank you for picking this up. I generally agree with the points you've brought up.
Let me add to your list two additional use cases I could think of: The tracing of binary data in motion (e.g., low-level network protocol) and hex dumps of binary files. Both typically encode a quantity of data well above the otherwise typical "max 256 bits"-cases. Ingesting those traces/dumps would consequently also use the decode methods with larger data sizes. Investigating whether a vectorized implementation was worth it for those cases showed that AVX2 could be up to 10x faster in decoding 2KB of hex-encoded data (see table below / full comparison here). There are already vectorized implementations in the broader ecosystem (e.g., this one in ASP.NET Core) that could be migrated to the API proposed here if we allow for some "hyper-optimization" to ensure the migration causes no perf regressions. I would suggest a for-loop approach for small input data sizes and using an SSSE3 or AVX2 approach above a given threshold. All of this is an implementation detail, however, and we should first settle on the API, I guess. More than happy to provide a PR by the way, once the API is approved. After integrating your feedback to my comment above, the API proposal would look like this: namespace System
{
// NEW methods on EXISTING type
public static class Convert
{
// Decode from chars
public static byte[] FromHexCharArray(char[] inArray, int offset, int length) => throw null;
public static bool TryFromHexChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) => throw null;
// Decode from strings
public static byte[] FromHexString(string s) => throw null;
public static byte[] FromHexString(ReadOnlySpan<char> chars) => throw null;
public static bool TryFromHexString(string s, Span<byte> bytes, out int bytesWritten) => throw null;
// Encode to chars
public static int ToHexCharArray(byte[] inArray, int offsetIn, int length, char[] outArray, int offsetOut, bool lowercase = false) => throw null;
public static bool TryToHexChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten, bool lowercase = false) => throw null;
// Encode to strings
public static string ToHexString(byte[] inArray, bool lowercase = false) => throw null;
public static string ToHexString(byte[] inArray, int offset, int length, bool lowercase = false) => throw null;
public static string ToHexString(ReadOnlySpan<byte> bytes, bool lowercase = false) => throw null;
}
} |
Maybe get in a simple implementation soon and rely on a community contribution to provide a super fast vectorized implementation. Sounds like a fun project for someone. |
@tkp1n For your scenarios, would those tracers want to do a straight hex dump "00112233..." or would they want things to be formatted "00-11-22-..."? I've never seen any amount of data larger than a few dozen bytes dumped to hex as-is without some kind of formatting to make it more human-readable. In the case of the aspnet core code you linked to above, it can be written in terms of the 100% fully type-safe
|
I did that already: https://github.com/tkp1n/HexMate/
I'd say that depends on whether the primary reason for tracing is the consumption of the trace by a human or by a machine (with maybe a human looking at it from time to time). I've seen both, but over a certain amount of data there is almost always some formatting applied. Another not yet mentioned source of "large" hex strings is the |
namespace System
{
public static class Convert
{
public static byte[] FromHexString(string s);
public static byte[] FromHexString(ReadOnlySpan<char> chars);
public static string ToHexString(byte[] inArray);
public static string ToHexString(byte[] inArray, int offset, int length);
public static string ToHexString(ReadOnlySpan<byte> bytes);
}
} |
@terrajobst or @GrabYourPitchforks feel free to assign this to me. I‘d be happy to provide a PR for this. |
@tkp1n done! Thanks :) |
It is quite common to need to convert bytes to hex strings and back. The .NET Framework does not have an API to do that. Look at what lengths people have gone to to solve this problem for them: How do you convert Byte Array to Hexadecimal String, and vice versa? Many of those solutions are somehow bad (e.g. slow or broken or not validating).
BitConverter.ToString(byte[])
generates strings of a practically useless format:BitConverter.ToString(new byte[] { 0x00, 0x01, 0xAA, 0xFF })
=>00-01-AA-FF
. I don't understand how this decision was made. This seems to be a special solution for some specific purpose (debug output?). Using this API requiresReplace(str, "-", "")
which is ugly and slow. Also, this API is not discoverable because it does not have "hex" in it's name.I propose adding an API that does the following:
The use case for char[] in- and outputs is low-allocation scenarios.
The API could look like this:
A possible extension would be to support
Stream
andIEnumerable<byte>
as sources and destinations. I would consider this out of scope.I feel that converting hexadecimal strings is an important and frequent enough thing that this should be supported well. Of course, a "light" version of this proposal is also feasible (just support for string and byte[] with no user-provided output buffer). Not sure if anyone wants/needs lowercase hex, I don't but others might.
This seems cheap enough to implement. Testing should be easily automatable. I don't see any particular usability problems. This is suitable for community contribution.
The text was updated successfully, but these errors were encountered: