[stdlib] Use SIMD to make `b64encode` 4.7x faster #3443

gabrieldemarmiesse · 2024-09-02T15:29:07Z

Dependencies

The following PR should be merged first:

[stdlib] Add new method SIMD._dynamic_shuffle() #3397

Description of the changes

b64encode is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet.

Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of b64encode around 5 times faster than the current one.

This implementation was taken from the following papers:

Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the
speed of a memory copy, Software: Practice and Experience 50 (2), 2020.
https://arxiv.org/abs/1910.05109
Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2
Instructions, ACM Transactions on the Web 12 (3), 2018.
https://arxiv.org/abs/1704.00605

Note that there are substancial differences between the papers and this implementation. There are two reasons for this:

We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib
We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture.

In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer.

We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used.

The reasons for the speedups are simlar to the ones provided in #3401

API changes

The existing api is

fn b64encode(str: String) -> String:

and has several limitations:

The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a String forces the user to handle null termination on its bytes and whatever other requirement String might have to use bytes.
It is not possible to write the produced bytes in an existing buffer.
It is hard to benchmark as the signature implies that the function allocates memory on the heap.
It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. Span would be a better choice here.

We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are:

fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _])
fn b64encode(input_bytes: List[UInt8, _]) -> String
fn b64encode(input_string: String) -> String

Note that it could be further improved in future PRs as currently Span is not easy to use but would be a right fit for the input value. We could also in the future remove fn b64encode(input_string: String) -> String.

Note that the python api takes bytes as input and returns bytes.

Benchmarking

Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system

WSL2 windows 11
Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
Base speed:	3,80 GHz
Sockets:	1
Cores:	8
Logical processors:	16
Virtualization:	Enabled
L1 cache:	512 KB
L2 cache:	2,0 MB
L3 cache:	16,0 MB

We get around 5x speedup.

I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script.

Future work

As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the simdutf library.

This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes.

In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers.

We can also re-write b64decode using simd and it's also expected that we'll get speedups. This can be the topic of another PR too.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

martinvuyk · 2024-09-02T19:47:35Z

FYI, since I saw intrinsics in your code, this might get blocked by issue #933 if Modular is using b64encode at compile time.
This works fine currently (nightly mojo 2024.9.105)

from base64 import b64encode
fn main():
    var data = b64encode("asd")
    alias data2 = b64encode("asd")
    print(data)
    print(data2)

gabrieldemarmiesse · 2024-09-03T08:48:46Z

I'll let the stdlib team chim in and tell us if they use b64 at compile-time currently or if comptime b64 can wait until the compiler improves.

stdlib/src/base64/base64.mojo

lemire · 2024-09-03T13:21:03Z

Currently, b64decode does not appear to handle white-space characters. I would have expected the following to print 'Bonjour', it does not:

from base64 import b64decode
def main():
    var data = b64decode("Qm9 uam91cg==")
    print(data)

It is possible to handle spaces and do validation at high speed. We have such algorithms in simdutf, and I am working on porting them to C#.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

lemire · 2024-09-06T21:44:47Z

We have such algorithms in simdutf, and I am working on porting them to C#.

Juste an update. So we have SIMD base64 decoding with spaces (skipping spaces) and full validation in C++ (in simdutf). That's in production (released version of Node.js and Bun).

It is not yet public, but we did the same for C#/.NET. The results are good. We just need to finish the AVX-512 kernel (which should be the best and fastest).

So I expect it should be portable to mojo. If I can do it in C#, surely mojo can do it too. :-)

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse · 2024-09-07T14:03:47Z

@lemire , many thanks for the insights. Indeed, the current b64 decoding seems to be lacking in many ways and will definitly benefit from being re-written with the algorithm in simdutf. That's something I can look into it in another pull request. This one is already proving quite big and complexe, especially since it's difficult to get both performance and be generic on the simd width.

Mogball · 2024-09-07T15:10:55Z

We don't use b64encode internally

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

JoeLoser

!sync

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

JoeLoser · 2024-10-18T03:54:55Z

!sync

Signed-off-by: Joe Loser <joe@modular.com>

JoeLoser · 2024-10-22T18:27:45Z

!sync

JoeLoser · 2024-10-23T16:24:57Z

stdlib/src/base64/_b64encode.mojo

+        # fmt: on
+    elif simd_width == 64:
+        # fmt: off
+        return input_vector.shuffle[


Question I think this is failing on Intel (such as m7i that we test on internally):

mojo --debug-level full -D ASSERT=all /mnt/engflow/worker/work/0/exec/bazel-out/k8-opt-release/bin/open-source/mojo/stdlib/test/base64/test_base64.mojo.test.runfiles/_main/open-source/mojo/stdlib/test/base64/test_base64.mojo /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:113:4: error: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:96:4: note: function instantiation failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:108:57: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:68:4: note: function instantiation failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:84:18: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/_startup.mojo:105:8: note: function instantiation failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/base64/_b64encode.mojo:383:38: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/base64/_b64encode.mojo:218:4: note: function instantiation failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/base64/_b64encode.mojo:224:48: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/base64/_b64encode.mojo:155:4: note: function instantiation failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/base64/_b64encode.mojo:211:10: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/simd.mojo:1965:45: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/simd.mojo:1910:10: note: call expansion failed /mnt/engflow/worker/work/0/exec/open-source/mojo/stdlib/stdlib/builtin/constrained.mojo:56:6: note: constraint failed: size of the mask must match the output SIMD size mojo: error: failed to run the pass manager

Thanks, I'll take a look. I wonder if we can add a similar cpu in github actions to catch those errors in the public CI.

I think the fix should be straight forward, just remove the 63.

Thanks for the fix @soraros ! I pushed the change in the latest commit!

When we set up the GH workflows for the OSS stdlib, we intentionally didn't want run on anything internal (e.g. using any of our infra that sits on top of AWS EC2 VMs like m7i, m7g, etc.). This is both from a security perspective and a cost thing. In the current state, we're just running basic stdlib unit tests on free GitHub-provided hosts. This is "mostly sufficient" as we've seen rather than running every OSS PR on the flurry of hardware and things we test on internally.

That's still not quite right: the input shuffled vector has way too many elements still (more than 64). I think you want

- 48, 49, 49, 50, - 51, 52, 52, 53, - 54, 55, 55, 56, - 57, 58, 58, 59, - 60, 61, 61, 62,

as a diff which brings the input shuffled vector to contain 64 elements. This passes on a m7i locally for me, for example. I just pushed this change internally to your PR to check CI.

soraros · 2024-10-23T22:02:04Z

stdlib/src/base64/_b64encode.mojo

+from memory.maybe_uninitialized import UnsafeMaybeUninitialized
+
+
+fn _subtract_with_saturation[


There is a LLVM intrinsics powered _sub_with_saturation (#3654) in _utf8_validation.mojo.

Good call. I put this function in the simd.mojo file as a private function to avoid duplication.

soraros · 2024-10-23T22:09:48Z

stdlib/src/base64/_b64encode.mojo

+alias END_SECOND_RANGE = 51
+
+
+fn _get_simd_range_values[simd_width: Int]() -> SIMD[DType.uint8, simd_width]:


Nit: add a FIXME indicating that this function is introduced because math.iota doesn't run at compile time. Maybe even consider renaming to _iota.

Good call. I didn't know we had a function for this. Probably because of how cryptic the name iota is (not everyone is a greek fan). I moved the function next to iota and added a TODO.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse · 2024-10-24T13:21:37Z

@JoeLoser can you retry on the m7i ? With @soraros 's fix this should be ok now.

JoeLoser · 2024-10-24T13:23:44Z

!sync

modularbot · 2024-10-29T17:24:03Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299

modularbot · 2024-10-30T16:50:05Z

Landed in a4b7e55! Thank you for your contribution 🎉

[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * modularml#3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in modularml#3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes modularml#3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299

[stdlib] Use SIMD to accelerate b64encode

3852dbf

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse mentioned this pull request Sep 3, 2024

[stdlib] Fix atof precision issues and use SOTA algorithm #3434

Open

lemire reviewed Sep 3, 2024

View reviewed changes

stdlib/src/base64/base64.mojo Show resolved Hide resolved

martinvuyk mentioned this pull request Sep 3, 2024

[BUG] b64decode does not handle whitespaces #3446

Open

gabrieldemarmiesse added 3 commits September 6, 2024 20:02

Merge branch 'nightly' into use_simd_on_b64encode

ad4c02f

Remove magic-related files

1591fa8

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Added notes and did some refactoring

317937c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Made the simd size variable

2903f4c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse added 5 commits September 8, 2024 10:53

Use compile-time programming to generate simd

e41c9ae

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Generate more lookup tables at compile-time

14d4978

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Put everything in a dedicated file to avoid cluttering

e0f1773

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Optimize the partial load and store

5e1f3a3

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Simplify the incomplete load and store

04d0f75

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse force-pushed the use_simd_on_b64encode branch 2 times, most recently from b28a439 to d3f6954 Compare September 8, 2024 12:40

Formatting the shuffle function

4d2f7f8

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse force-pushed the use_simd_on_b64encode branch from d3f6954 to 4d2f7f8 Compare September 8, 2024 12:50

gabrieldemarmiesse marked this pull request as ready for review September 8, 2024 14:31

gabrieldemarmiesse requested a review from a team as a code owner September 8, 2024 14:31

gabrieldemarmiesse added 2 commits September 14, 2024 20:39

Merge branch 'nightly' into use_simd_on_b64encode

6b4ce91

Merge branch 'nightly' into use_simd_on_b64encode

1d8b435

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

JoeLoser reviewed Sep 18, 2024

View reviewed changes

soraros mentioned this pull request Sep 21, 2024

[stdlib] Implement Python-like base64 encoding/decoding #3513

Closed

Merge branch 'nightly' into use_simd_on_b64encode

8927b95

gabrieldemarmiesse added 2 commits September 21, 2024 11:02

Use memory.bitcast() instead of a custom one

52d2a69

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Merge branch 'nightly' into use_simd_on_b64encode

202d710

JoeLoser self-assigned this Oct 13, 2024

Merge branch 'nightly' into use_simd_on_b64encode

56649f5

Signed-off-by: Joe Loser <joe@modular.com>

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Oct 22, 2024

JoeLoser reviewed Oct 23, 2024

View reviewed changes

soraros reviewed Oct 23, 2024

View reviewed changes

gabrieldemarmiesse added 4 commits October 24, 2024 09:40

Merge branch 'nightly' into use_simd_on_b64encode

3662496

Put saturated substraction in simd.mojo

bac4d14

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Move compile-time iota function next to iota

49a8422

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Removed the last 63

3992861

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

modularbot added merged-internally Indicates that this pull request has been merged internally merged-externally Merged externally in public mojo repo labels Oct 29, 2024

modularbot closed this Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Use SIMD to make `b64encode` 4.7x faster #3443

[stdlib] Use SIMD to make `b64encode` 4.7x faster #3443

gabrieldemarmiesse commented Sep 2, 2024 •

edited

Loading

martinvuyk commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024

lemire commented Sep 3, 2024

lemire commented Sep 6, 2024

gabrieldemarmiesse commented Sep 7, 2024

Mogball commented Sep 7, 2024

JoeLoser left a comment

JoeLoser commented Oct 18, 2024

JoeLoser commented Oct 22, 2024

JoeLoser Oct 23, 2024 •

edited

Loading

gabrieldemarmiesse Oct 23, 2024

soraros Oct 23, 2024

gabrieldemarmiesse Oct 24, 2024

JoeLoser Oct 24, 2024

JoeLoser Oct 24, 2024 •

edited

Loading

soraros Oct 23, 2024

gabrieldemarmiesse Oct 24, 2024

soraros Oct 23, 2024

gabrieldemarmiesse Oct 24, 2024

gabrieldemarmiesse commented Oct 24, 2024

JoeLoser commented Oct 24, 2024

modularbot commented Oct 29, 2024

modularbot commented Oct 30, 2024

		from memory.maybe_uninitialized import UnsafeMaybeUninitialized


		fn _subtract_with_saturation[

		alias END_SECOND_RANGE = 51


		fn _get_simd_range_values[simd_width: Int]() -> SIMD[DType.uint8, simd_width]:

[stdlib] Use SIMD to make b64encode 4.7x faster #3443

[stdlib] Use SIMD to make b64encode 4.7x faster #3443

Conversation

gabrieldemarmiesse commented Sep 2, 2024 • edited Loading

Dependencies

Description of the changes

API changes

Benchmarking

Future work

martinvuyk commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024

lemire commented Sep 3, 2024

lemire commented Sep 6, 2024

gabrieldemarmiesse commented Sep 7, 2024

Mogball commented Sep 7, 2024

JoeLoser left a comment

Choose a reason for hiding this comment

JoeLoser commented Oct 18, 2024

JoeLoser commented Oct 22, 2024

JoeLoser Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeLoser Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrieldemarmiesse commented Oct 24, 2024

JoeLoser commented Oct 24, 2024

modularbot commented Oct 29, 2024

modularbot commented Oct 30, 2024

[stdlib] Use SIMD to make `b64encode` 4.7x faster #3443

[stdlib] Use SIMD to make `b64encode` 4.7x faster #3443

gabrieldemarmiesse commented Sep 2, 2024 •

edited

Loading

JoeLoser Oct 23, 2024 •

edited

Loading

JoeLoser Oct 24, 2024 •

edited

Loading