Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: BufferTransformer and Blowfish Refactor #4151

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

reneme
Copy link
Collaborator

@reneme reneme commented Jun 26, 2024

This picks up the work started in #3942 by proposing a BufferTransformer class; which is essentially an orchestrator for BufferSlicer and BufferStuffer. As mentioned yesterday, I added some template fanciness, to handle the "block processing" necessary for various block ciphers. As suggested previously, this uses Blowfish to demonstrate the functionality (+ spanifying the blowfish implementation).

@randombit Is the process_blocks_of<> implementation in line with what you had in mind previously:

I wonder if we can fit in some other common functionality there, namely the word level loads/stores. That's a very common task (encrypt+decrypt for N ciphers plus alternative implementations)

Here's the idea of BufferTransformer::process_blocks_of, where the multi-block loop handling and buffer management is entirely abstracted. Also, as an additional goodie, we now have explicit compile-time knowledge of the blocked data length inside the processing lambdas. Whether the compiler uses this or not has to be seen. FWIW: I don't see a difference in the performance of Blowfish on my Mac.

void encrypt_n(const uint8_t inb[], uint8_t outb[], size_t blocks) {
   std::span<const uint8_t> in(inb, inb + blocks*BS);
   std::span<uint8_t> out(outb, outb + blocks*BS);
   
   BufferTransformer bt(in, out);
   bt.process_blocks_of<BS*4, BS>(
      overloaded{
         [](std::span<const uint8_t, BS*4> in, std::span<uint8_t, BS*4> out) {
            // encrypt four blocks at a time
         },
         [](std::span<const uint8_t, BS> in, std::span<uint8_t, BS> out) {
            // encrypt a single block
         }
      }
   );
}

No hard feelings if that doesn't make it into 3.5.0 at all. I just nerd-sniped myself yesterday and wanted to sketch this out. And frankly, I think the BufferTransformer implementation is quite concise. 😄

@reneme reneme added the enhancement Enhancement or new feature label Jun 26, 2024
@reneme reneme requested a review from randombit June 26, 2024 15:16
@reneme reneme force-pushed the rene/buffer_transformer branch 2 times, most recently from 4feb67d to 87d4035 Compare June 27, 2024 07:15
@reneme reneme mentioned this pull request Jun 27, 2024
@coveralls
Copy link

Coverage Status

coverage: 91.742% (+0.02%) from 91.72%
when pulling 7b3c762 on Rohde-Schwarz:rene/buffer_transformer
into 66216e1 on randombit:master.

@reneme
Copy link
Collaborator Author

reneme commented Aug 5, 2024

As suggested, I also applied this to the AES-NI-128 encryption path to verify that BufferTransformer boils down to a tight loop without significant overhead. I didn't add the patch to the PR as I didn't transform all encrypt/decrypt code paths for this test.

See the resulting code here (clang 18): #4287 (comment)

... and the corresponding C++ code:

BufferTransformer(std::span{in, blocks * BLOCK_SIZE}, std::span{out, blocks * BLOCK_SIZE})
    .process_blocks_of<4 * BLOCK_SIZE, BLOCK_SIZE>(overloaded{
        [&](std::span<const uint8_t, 4 * BLOCK_SIZE> in, std::span<uint8_t, 4 * BLOCK_SIZE> out) {
            SIMD_4x32 B0 = SIMD_4x32::load_le(in.data() + 16 * 0);
            SIMD_4x32 B1 = SIMD_4x32::load_le(in.data() + 16 * 1);
            SIMD_4x32 B2 = SIMD_4x32::load_le(in.data() + 16 * 2);
            SIMD_4x32 B3 = SIMD_4x32::load_le(in.data() + 16 * 3);

            keyxor(K0, B0, B1, B2, B3);
            aesenc(K1, B0, B1, B2, B3);
            aesenc(K2, B0, B1, B2, B3);
            aesenc(K3, B0, B1, B2, B3);
            aesenc(K4, B0, B1, B2, B3);
            aesenc(K5, B0, B1, B2, B3);
            aesenc(K6, B0, B1, B2, B3);
            aesenc(K7, B0, B1, B2, B3);
            aesenc(K8, B0, B1, B2, B3);
            aesenc(K9, B0, B1, B2, B3);
            aesenclast(K10, B0, B1, B2, B3);

            B0.store_le(out.data() + 16 * 0);
            B1.store_le(out.data() + 16 * 1);
            B2.store_le(out.data() + 16 * 2);
            B3.store_le(out.data() + 16 * 3);
            },
        [&](std::span<const uint8_t, BLOCK_SIZE> in, std::span<uint8_t, BLOCK_SIZE> out) {
            SIMD_4x32 B0 = SIMD_4x32::load_le(in.data());

            B0 ^= K0;
            aesenc(K1, B0);
            aesenc(K2, B0);
            aesenc(K3, B0);
            aesenc(K4, B0);
            aesenc(K5, B0);
            aesenc(K6, B0);
            aesenc(K7, B0);
            aesenc(K8, B0);
            aesenc(K9, B0);
            aesenclast(K10, B0);

            B0.store_le(out.data());
        },
    });
}

@coveralls
Copy link

Coverage Status

coverage: 91.37% (+0.03%) from 91.344%
when pulling c88455d on Rohde-Schwarz:rene/buffer_transformer
into 13edb92 on randombit:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants