-
Notifications
You must be signed in to change notification settings - Fork 10
/
blake3.tex
1548 lines (1328 loc) · 84.6 KB
/
blake3.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt,notitlepage,a4paper]{article}
%\usepackage[a4paper,hmargin=1.3in,vmargin=1.3in]{geometry}
\usepackage[
bookmarks=true, % show bookmarks bar?
unicode=true, % non-Latin characters in Acrobat’s bookmarks
pdftoolbar=false, % show Acrobat’s toolbar?
pdfmenubar=false, % show Acrobat’s menu?
pdffitwindow=false, % window fit to page when opened
pdfstartview={FitH}, % fits the width of the page to the window
pdftitle={BLAKE3}, % title
pdfauthor={BLAKE3 team}, % author
pdfsubject={BLAKE3}, % subject of the document
pdfnewwindow=true, % links in new window
colorlinks=true, % false: boxed links; true: colored links
linkcolor=orangered, % color of internal links
citecolor=orangered, % color of links to bibliography
filecolor=magenta, % color of file links
urlcolor=orangered, % color of external links
]{hyperref}
\usepackage{lmodern}
\usepackage{url}
%\usepackage{fullpage}
\usepackage{a4wide}
\usepackage{amsmath,amssymb,cite,dsfont}
\usepackage{verbatim,footmisc}
\usepackage{array}
%\usepackage{euler}
%\usepackage{helvet}
%\renewcommand{\familydefault}{\sfdefault}
%\fontfamily{phv}\selectfont
\renewcommand{\familydefault}{\sfdefault}
\usepackage[scaled=1]{helvet}
\usepackage[helvet]{sfmath}
\everymath={\sf}
\usepackage[T1]{fontenc} % improves underscore rendering
\usepackage{textcomp}
\usepackage{microtype}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{datetime}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage[english]{babel}
\usepackage[autostyle,english=american]{csquotes}
\usepackage{xspace}
\usepackage{booktabs}
\usepackage{listings}
\usepackage{tikz}
\usepackage[shortcuts]{extdash} % unbreakable dashes, e.g. SHA\=/2
\usepackage[outputdir=build]{minted} % syntax highlighting using Pygments
\usepackage[title]{appendix} % add "Appendix" to each appendix section title
\usepackage[font=small,labelfont=bf]{caption}
\usetikzlibrary
{
calc,
positioning,
shapes,
shapes.geometric,
arrows
}
\usepackage{pgfplots}
\pgfplotsset{
compat=newest,
% As per https://tex.stackexchange.com/a/350514/201018, this lets us force
% one graph line to the top, so that it's always drawn overlapping other
% lines. Used in img/avx512.tikz.
layers/my layer set/.define layer set={main, foreground}{},
set layers=my layer set,
}
\definecolor{darkred}{RGB}{170,0,0}
\definecolor{darkblue}{RGB}{0,0,170}
\definecolor{darkgreen}{RGB}{0,170,0}
\definecolor{orangered}{RGB}{255,69,0}
\definecolor{bluecompl}{RGB}{0,186,255}
\newdateformat{mydate}{\THEYEAR.\twodigit\THEMONTH.\twodigit\THEDAY}
\newcommand{\GG}{\mathsf{G}}
\newcommand{\IV}{\text{IV}}
\newcommand{\cpress}{\text{\textbf{compress}}}
\newcommand{\lto}{\leftarrow}
\newcommand{\BB}{\mathsf{B2}}
\newcommand{\flag}[1]{\texttt{\detokenize{#1}}\xspace}
\newcommand{\alert}[1]{\textcolor{red}{#1}}
\newcommand{\mytitle}{BLAKE3}
\title{\mytitle}
\begin{document}
\fontfamily{phv}\selectfont
\pagestyle{plain}
{\let\thefootnote\relax\footnotetext{Version \texttt{\pdfdate}.}}
\begin{center}
{\Huge \bf \mytitle}
\medskip
{\Large \bf one function, fast everywhere}
\medskip
Jack O'Connor (\texttt{@oconnor663}) \\
Jean-Philippe Aumasson (\texttt{@veorq}) \\
Samuel Neves (\texttt{@sevenps}) \\
Zooko Wilcox-O'Hearn (\texttt{@zooko}) \\
\medskip
{\large \url{https://blake3.io}}
\end{center}
\medskip
\begin{center}
\begin{minipage}{0.92\linewidth}
We present BLAKE3, an evolution of the BLAKE2 cryptographic hash that is
both faster and also more consistently fast across different platforms
and input sizes. BLAKE3 supports an unbounded degree of parallelism,
using a tree structure that scales up to any number of SIMD lanes and CPU
cores. On Intel Cascade Lake-SP, peak single-threaded throughput is
$4\times$ that of BLAKE2b, $8\times$ that of SHA\=/512, and $12\times$
that of SHA\=/256, and it can scale further using multiple threads.
BLAKE3 is also efficient on smaller architectures: throughput on a 32-bit
ARM1176 core is $1.3\times$ that of SHA\=/256 and $3\times$ that of
BLAKE2b and SHA\=/512. Unlike BLAKE2 and SHA\=/2, with different variants
better suited for different platforms, BLAKE3 is a single algorithm with
no variants. It provides a simplified API supporting all the use cases of
BLAKE2, including keying and extendable output. The tree structure also
supports new use cases, such as verified streaming and incremental
updates.
\end{minipage}
\end{center}
\newpage
\tableofcontents
\newpage
\section{Introduction}\label{sec:intro}
Since its announcement in 2012, BLAKE2~\cite{DBLP:conf/acns/AumassonNWW13} has
seen widespread adoption, in large part because of its superior performance in
software. BLAKE2b and BLAKE2s are included in OpenSSL and in the Python and Go
standard libraries. BLAKE2b is also included as the \texttt{b2sum} utility in
GNU Coreutils, as the \texttt{generichash} API in Libsodium, and as the
underlying hash function for Argon2~\cite{DBLP:conf/eurosp/BiryukovDK16}, the
winner of the Password Hashing Competition in 2015.
A drawback of BLAKE2 has been its large number of incompatible variants.
The performance tradeoffs between different variants are subtle, and
library support is uneven. BLAKE2b is the most widely supported, but it is not
the fastest on most platforms. BLAKE2bp and BLAKE2sp are more than twice as
fast on modern x86 processors, but they are sparsely supported and rarely
adopted.
BLAKE3 eliminates this drawback. It is a single algorithm with no variants,
designed for consistent high performance in software on all platforms. The
biggest changes from BLAKE2 to BLAKE3 are:
\begin{itemize}
\item \textbf{A binary tree structure}, providing an unbounded
degree of parallelism.
\item \textbf{Fewer rounds} in the BLAKE2s-based compression
function (7 instead of 10).
\item \textbf{Three modes}: hashing, keyed hashing, and key derivation.
These modes replace the BLAKE2 parameter block and provide a simpler
API.
\item \textbf{Zero-cost keyed hashing}, using the space formerly occupied
by the parameter block for the key.
\item \textbf{Built-in extendable output} (a.k.a. XOF), which is
parallelizable and seekable, like BLAKE2X but unlike SHA\=/3 or HKDF.
\end{itemize}
BLAKE3 splits its input into 1~KiB chunks and arranges them as the leaves of a
binary tree. Each chunk is compressed independently, so the degree of
parallelism is equal to the number of
chunks~\cite{DBLP:journals/tosc/AtighehchiB17,DBLP:journals/tc/AtighehchiR17}.
This makes BLAKE3 very fast on modern processors, where it can take advantage
of SIMD instructions and multiple cores. Another benefit of compressing chunks
in parallel is that the implementation can use SIMD vectors of any width,
regardless of the word size of the compression function (see
\S\ref{sec:wordsize}). That leaves us free to select a compression function
that is efficient on smaller architectures, without sacrificing peak throughput
on x86\=/64. Thus our choice of BLAKE2s instead of BLAKE2b.
BLAKE3 has the same 128-bit security level and 256-bit default output size as
BLAKE2s. The internal mixing logic of the compression function is also the
same, with a simplified message schedule derived from a single repeated
permutation. Based on existing cryptanalysis of BLAKE and BLAKE2, BLAKE3
reduces the number of rounds in the compression function from 10 to 7
(see~\cite{TMC} for a detailed rationale). BLAKE3 also changes the setup and
finalization steps of the compression function to support the internal tree
structure, more efficient keying, and extendable output.
\section{Specification}\label{sec:specification}
\subsection{Tree Structure}\label{sec:tree}
The input of BLAKE3 is split into contiguous chunks of 1024 bytes, such
that the last chunk may be shorter, but not empty, unless the entire input is empty.
If there is only one chunk, that chunk
is the root node and only node of the tree.
Otherwise, the chunks are assembled with parent nodes, each parent node having exactly two children.
Two rules determine the structure of the tree:
\begin{enumerate}
\item Left subtrees are full. Each left subtree is a complete binary tree,
with all its chunks at the same depth, and a number of chunks that is a
power of 2.
\item Left subtrees are big. Each left subtree contains a number of chunks
greater than or equal to the number of chunks in its sibling right
subtree.
\end{enumerate}
In other words, given a message of $n > 1024$ bytes, the left subtree consists
of the first $$2^{10 + \left\lfloor \log_2 \left(\left\lfloor \frac{n-1}{1024}
\right\rfloor\right) \right\rfloor}$$ bytes, and the right subtree consists of
the remainder. For example, trees from 1 to 4 chunks have the structure shown
in Figure~\ref{fig:fourchunks}.
\begin{figure}[h]
\centering
\input{img/tree1to4.tikz}
\caption{Example tree structures, from 1 to 4 chunks.}
\label{fig:fourchunks}
\end{figure}
The compression function is used to derive a chaining value from each chunk and
parent node. The chaining value of the root node, encoded as 32 bytes in
little-endian order, is the default-length BLAKE3 hash of the input. BLAKE3
supports input of any byte length $0 \leq \ell < 2^{64}$.
\subsection{Compression Function}\label{sec:compression}
The compression function works with 32-bit words, and takes the
following inputs:
\begin{itemize}
\item The input chaining value, $h_{0} \ldots h_{7}$ (256 bits).
\item The message block, $m_{0} \ldots m_{15}$ (512 bits).
\item A 64-bit counter, $t=t_{0},t_{1}$, with $t_{0}$ the lower order word
and $t_{1}$ the higher order word.
\item The number of input bytes in the block, $b$ (32 bits).
\item A set of domain separation bit flags, $d$ (32 bits).
\end{itemize}
The compression function initializes its 16-word internal state $v_{0} \ldots
v_{15}$ as follows:
\begin{equation*}
\begin{pmatrix}
v_{0} & v_{1} & v_{2} & v_{3} \\
v_{4} & v_{5} & v_{6} & v_{7} \\
v_{8} & v_{9} & v_{10} & v_{11} \\
v_{12} & v_{13} & v_{14} & v_{15} \\
\end{pmatrix}
\leftarrow
\begin{pmatrix}
h_{0} & h_{1} & h_{2} & h_{3} \\
h_{4} & h_{5} & h_{6} & h_{7} \\
\IV_{0} & \IV_{1} & \IV_{2} & \IV_{3} \\
t_{0} & t_{1} & b & d
\end{pmatrix}
\end{equation*}
The $\IV_{0} \ldots \IV_{7}$ constants are the same as in
BLAKE2s, and they are reproduced in Table~\ref{tab:constants}.
\begin{table}
\caption{Constants $\IV_0\dots\IV_7$ used in the initialization of BLAKE3.}%
\label{tab:constants}
\centering
\begin{tabular}{cc}
\toprule
$\IV_0$ & \texttt{0x6a09e667} \\
$\IV_1$ & \texttt{0xbb67ae85} \\
$\IV_2$ & \texttt{0x3c6ef372} \\
$\IV_3$ & \texttt{0xa54ff53a} \\
$\IV_4$ & \texttt{0x510e527f} \\
$\IV_5$ & \texttt{0x9b05688c} \\
$\IV_6$ & \texttt{0x1f83d9ab} \\
$\IV_7$ & \texttt{0x5be0cd19} \\
\bottomrule
\end{tabular}
\end{table}
The compression function applies a 7-round keyed permutation $v' = E(m, v)$ to
the state $v_0 \dots v_{15}$, keyed by the message $m_0 \dots m_{15}$.
The round function is the same as in BLAKE2s. A round consists of the following 8 operations:
\begin{align*}
\GG_{0}&(v_{0}, v_{4}, v_{8}, v_{12}) &
\GG_{1}&(v_{1}, v_{5}, v_{9}, v_{13}) &
\GG_{2}&(v_{2}, v_{6}, v_{10}, v_{14}) &
\GG_{3}&(v_{3}, v_{7}, v_{11}, v_{15}) \\
\GG_{4}&(v_{0}, v_{5}, v_{10}, v_{15}) &
\GG_{5}&(v_{1}, v_{6}, v_{11}, v_{12}) &
\GG_{6}&(v_{2}, v_{7}, v_{8}, v_{13}) &
\GG_{7}&(v_{3}, v_{4}, v_{9}, v_{14})\,.
\end{align*}
That is, a round applies a G function to each column of the $4\times 4$ state in parallel,
and then to each of the diagonals in parallel.
The ``quarter-round'' $\GG_i(a, b, c, d)$ is
defined as follows. Let $\oplus$ denote xor, $+$ denote addition modulo $2^{32}$, $\ggg$ denote
bitwise right rotation, and $m_{x}$ be the $x$th message word:
\begin{align*}
a \ & \leftarrow \ a + b + m_{2i+0} \\
d \ & \leftarrow \ (d \oplus a) \ggg 16 \\
c \ & \leftarrow \ c + d \\
b \ & \leftarrow \ (b \oplus c) \ggg 12 \\
a \ & \leftarrow \ a + b + m_{2i+1} \\
d \ & \leftarrow \ (d \oplus a) \ggg 8 \\
c \ & \leftarrow \ c + d \\
b \ & \leftarrow \ (b \oplus c) \ggg 7
\end{align*}
After each round (except the last one where it would be useless), the message words are permuted according to Table~\ref{tab:perm}.
\begin{table}
\centering
\caption{Permutational key schedule for BLAKE3's keyed permutation.}\label{tab:perm}
\begin{tabular}{cccccccccccccccc}
\toprule
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\ \midrule
2 & 6 & 3 & 10 & 7 & 0 & 4 & 13 & 1 & 11 & 12 & 5 & 9 & 14 & 15 & 8 \\
\bottomrule
\end{tabular}
\end{table}
The output of the compression function $h_{0}' \ldots h_{15}'$ is
defined as:
\begin{align*}
h_{0}' \ & \leftarrow \ v'_{0} \oplus v'_{8} &
h_{8}' \ & \leftarrow \ v'_{8} \oplus h_{0} \\
h_{1}' \ & \leftarrow \ v'_{1} \oplus v'_{9} &
h_{9}' \ & \leftarrow \ v'_{9} \oplus h_{1} \\
h_{2}' \ & \leftarrow \ v'_{2} \oplus v'_{10} &
h_{10}' \ & \leftarrow \ v'_{10} \oplus h_{2} \\
h_{3}' \ & \leftarrow \ v'_{3} \oplus v'_{11} &
h_{11}' \ & \leftarrow \ v'_{11} \oplus h_{3} \\
h_{4}' \ & \leftarrow \ v'_{4} \oplus v'_{12} &
h_{12}' \ & \leftarrow \ v'_{12} \oplus h_{4} \\
h_{5}' \ & \leftarrow \ v'_{5} \oplus v'_{13} &
h_{13}' \ & \leftarrow \ v'_{13} \oplus h_{5} \\
h_{6}' \ & \leftarrow \ v'_{6} \oplus v'_{14} &
h_{14}' \ & \leftarrow \ v'_{14} \oplus h_{6} \\
h_{7}' \ & \leftarrow \ v'_{7} \oplus v'_{15} &
h_{15}' \ & \leftarrow \ v'_{15} \oplus h_{7}\,.
\end{align*}
If we define $v_l$ (resp. $v'_l$) and $v_h$ (resp. $v'_h$) as the first and last 8-words of the input (resp. output) of $E(m, v)$, the compression function may be written as
\[
\begin{pmatrix}
1 & 1 \\
0 & 1
\end{pmatrix}\cdot%
\begin{pmatrix}
v'_l \\ v'_h
\end{pmatrix} +
\begin{pmatrix}
0 \\ v_l
\end{pmatrix}\,.
\]
The output of the compression function is most often truncated to produce 256-bit chaining values. %We defer to analysis of the compression function to \S\ref{sec:security}.
%\subsubsection{Flags}
The compression function input $d$ is a bitfield, with each individual flag consisting of a power of 2. The value of $d$ is the sum of all the flags defined for a given
compression. Their names and values are given in Table~\ref{tab:flags}.
\begin{table}
\centering
\caption{Admissible values for input $d$ in the BLAKE3 compression function.}%
\label{tab:flags}
\begin{tabular}{cc}
\toprule
Flag name & Value \\ \midrule
\flag{CHUNK_START} & $2^0$ \\
\flag{CHUNK_END} & $2^1$ \\
\flag{PARENT} & $2^2$ \\
\flag{ROOT} & $2^3$ \\
\flag{KEYED_HASH} & $2^4$ \\
\flag{DERIVE_KEY_CONTEXT} & $2^5$ \\
\flag{DERIVE_KEY_MATERIAL} & $2^6$ \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Modes}\label{sec:modes}
BLAKE3 defines three domain-separated modes: \flag{hash(input)},
\flag{keyed_hash(key, input)}, and \flag{derive_key(context, key_material)}.
These are intended to be user-facing APIs in an implementation. The \flag{hash}
mode takes an input of any byte length $0 \leq \ell < 2^{64}$. The
\flag{keyed_hash} mode takes an input of any length together with a 256-bit
key. The \flag{derive_key} mode takes a context string and key material, both
of any length. As described in \S\ref{sec:extendable}, all three modes can
produce output of any length. The intended usage of \flag{derive_key} is given
in \S\ref{sec:kdf}.
The modes differ from each other in their key words $k_{0} \ldots k_{7}$ and in
the additional flags they set for every call to the compression function. In
the \flag{hash} mode, $k_0 \ldots k_7$ are the constants $\IV_0 \ldots \IV_7$,
and no additional flags are set. In the \flag{keyed_hash} mode, $k_0 \ldots
k_7$ are parsed in little-endian order from the 256-bit key given by the
caller, and the \flag{KEYED_HASH} flag is set for every compression.
The third mode, \flag{derive_key}, has two stages. First the context string is
hashed, with $k_0 \ldots k_7$ set to the constants $\IV_0 \ldots \IV_7$, and
the \flag{DERIVE_KEY_CONTEXT} flag set for every compression. Then the key
material is hashed, with $k_0 \ldots k_7$ set to the first 8 output words of
the first stage, and the \flag{DERIVE_KEY_MATERIAL} flag set for every
compression.
\subsection{Chunk Chaining Values}\label{sec:chunk}
Processing a chunk is structurally similar to the sequential hashing mode of BLAKE2. Each chunk of up to 1024 bytes is split into blocks of up to 64 bytes.
The last block of the last chunk may be shorter, but not empty, unless the
entire input is empty. If necessary, the last block is padded with zeros to be 64
bytes.
Each block is parsed in little-endian order into message words $m_{0}
\ldots m_{15}$ and compressed. The input chaining value $h_{0} \ldots h_{7}$
for the first block of each chunk is composed of the key words $k_{0} \ldots k_{7}$. The
input chaining value for subsequent blocks in each chunk is the output of the truncated
compression function for the previous block.
The remaining compression function parameters are handled as follows (see also Figure~\ref{fig:chunk} for an example):
\begin{itemize}
\item The counter $t$ for each block is the chunk index, i.e., $0$ for all
blocks in the first chunk, $1$ for all blocks in the second chunk, and so on.
\item The block length $b$ is the
number of input bytes in each block, i.e., $64$ for all blocks except the last block of
the last chunk, which may be short.
\item The first block of each chunk sets the
\flag{CHUNK_START} flag (cf. Table~\ref{tab:flags}), and the last block of each chunk sets the
\flag{CHUNK_END} flag. If a chunk contains only one block, that block sets
both \flag{CHUNK_START} and \flag{CHUNK_END}. If a chunk is the root of
its tree, the last block of that chunk also sets the \flag{ROOT} flag.
\end{itemize}
The output of the truncated compression function for the last block in a chunk
is the chaining value of that chunk.
\begin{figure}
\centering
\input{img/chunk.tikz}
\caption{Example of compression function inputs when hashing a 181-byte input $(m_0, m_1, m_2)$ into a 128-byte output $(h_0, h_1)$. Trapezia indicate that the compression function output is truncated to 256 bits.}\label{fig:chunk}
\end{figure}
\subsection{Parent Node Chaining Values}\label{sec:parent}
Each parent node has exactly two children, each either a chunk or another
parent node. The chaining value of each parent node is given by a single call
to the compression function. The input chaining value $h_{0} \ldots h_{7}$ is
the key words $k_{0} \ldots k_{7}$. The message words $m_{0} \ldots m_{7}$ are
the chaining value of the left child, and the message words $m_{8} \ldots
m_{15}$ are the chaining value of the right child. The counter $t$ for parent
nodes is always 0. The number of bytes $b$ for parent nodes is always 64.
Parent nodes set the \flag{PARENT} flag. If a parent node is the root of the
tree, it also sets the \flag{ROOT} flag. The output of the truncated
compression function is the chaining value of the parent node.
\subsection{Extendable Output}\label{sec:extendable}
BLAKE3 can produce outputs of any byte length $0 \leq \ell < 2^{64}$.
This is done by repeating the root compression---that is, the very last
call to the compression function, which sets the \flag{ROOT} flag---with
incrementing values of the counter $t$. The results of these repeated root
compressions are then concatenated to form the output.
When building the output, BLAKE3 uses the full output of the
compression function (cf. \S\ref{sec:compression}). Each 16-word output is
encoded as 64 bytes in little-endian order.
Observe that based on \S\ref{sec:chunk} and \S\ref{sec:parent} above, the first
root compression always uses the counter value $t = 0$. That is either because
it is the last block of the only chunk, which has a chunk index of $0$, or
because it is a parent node. After the first root compression, as long as more
output bytes are needed, $t$ is incremented by 1, and the root compression is
repeated on otherwise the same inputs. If the target output length is not a
multiple of 64, the final compression output is truncated. See
Figure~\ref{fig:chunk} for an example.
Because the repeated root compressions differ only in the value of $t$, the
implementation can execute any number of them in parallel. The caller can also
adjust $t$ to seek to any point in the output stream.
Note that in contrast to BLAKE2 and BLAKE2X, BLAKE3 does not domain separate
outputs of different lengths. Shorter outputs are prefixes of longer ones. The
caller can extract as much output as needed, without knowing the final length
in advance.
\section{Performance}\label{sec:performance}
\begin{figure}[h]
\centering
\input{./img/avx512.tikz}
\caption{Throughput for single-threaded BLAKE3 and other hash functions on an
AWS c5.metal instance, measured in cycles per byte. This is an Intel
Cascade Lake-SP processor with AVX-512 support. Lower is faster.}%
\label{fig:avx512}
\end{figure}
\begin{figure}[h]
\centering
\input{./img/threads.tikz}
\caption{Throughput for multi-threaded BLAKE3 on the same c5.metal instance
as in Figure~\ref{fig:avx512}, measured in GiB/s, varying the number of
threads. This machine has 48 physical cores. Higher is faster.}%
\label{fig:threads}
\end{figure}
\begin{figure}[h]
\centering
%\input{./benchmarks/results/rpizero.pgf}
\input{./img/rpizero.tikz}
\caption{Throughput for single-threaded BLAKE3 and other hash functions on a
Rasperry Pi Zero, measured in cycles per byte. This is a 32-bit ARM1176
processor. Lower is faster.}%
\label{fig:rpizero}
\end{figure}
Several factors contribute to the performance of BLAKE3, depending on the
platform and the size of the input:
\begin{itemize}
\item The implementation can compress groups of chunks in parallel using
SIMD (cf.~\S\ref{sec:simd}). This is visible at 2~KiB and above in
Figure~\ref{fig:avx512}.
\item The implementation can use multiple threads
(cf.~\S\ref{sec:multithreading}). This is the subject of
Figure~\ref{fig:threads}. Note that Figures~\ref{fig:avx512}
and~\ref{fig:rpizero} are single-threaded.
\item The compression function performs fewer rounds than in BLAKE2s. This
improves performance across the board, but it is especially visible in
the difference between BLAKE3 and BLAKE2s in Figure~\ref{fig:rpizero}.
\item The compression function uses 32-bit words (cf.
\S\ref{sec:wordsize}), which perform well on smaller architectures.
This is the subject of Figure~\ref{fig:rpizero}.
\item The 64-byte block size is relatively small, which helps performance
for very short inputs. This is visible at 64 bytes (the left edge) in
Figures~\ref{fig:avx512} and~\ref{fig:rpizero}.
\end{itemize}
Figure~\ref{fig:avx512} shows the throughput of single-threaded BLAKE3 on
modern server hardware. This benchmark ran on an AWS c5.metal instance with a
pair of Intel Xeon Platinum 8275CL (Cascade Lake-SP) processors, which support
AVX-512 vector instructions.
The BLAKE3 curve passes through several different regimes in
Figure~\ref{fig:avx512}. For a single chunk of input, up to 1~KiB, BLAKE3
closely mirrors BLAKE2s (as it does throughout all of
Figure~\ref{fig:rpizero}). This is because their compression functions are very
similar, apart from the number of rounds. Between 1 and 2~KiB, the gap between
BLAKE3 and BLAKE2s closes slightly. This is because BLAKE3 divides the input
into two chunks with a parent node, and compressing the parent node adds
overhead. At 2~KiB, BLAKE3 begins compressing multiple chunks in parallel. This
implementation uses 2-way row-based parallelism at 2~KiB, 4-way row-based
parallelism at 4~KiB, 8-way word-based parallelism at 8~KiB, and 16-way
word-based parallelism at 16~KiB (see \S\ref{sec:simd}). Beyond 16~KiB,
throughput is relatively stable, until memory bandwidth effects appear for very
long inputs.
On this platform, the biggest performance difference between algorithms is how
much they can benefit from SIMD. Only BLAKE3 and KangarooTwelve are able to
take full advantage of AVX-512. The fixed-interleaved tree structures of
BLAKE2bp and BLAKE2sp limit those algorithms to 256-bit vectors. Other hash
functions are much slower, because they cannot compress multiple blocks in
parallel. BLAKE3 and KangarooTwelve are close to each other in peak throughput, but
KangarooTwelve needs more input to reach its peak, in part because of its
larger 8~KiB chunk size.
Figure~\ref{fig:threads} shows the throughput of multi-threaded BLAKE3. This
benchmark ran on the same c5.metal instance as in Figure~\ref{fig:avx512},
which has 48 physical cores. This implementation uses the Rust Rayon library to
manage threading, with the recursive divide-and-conquer approach described in
\S\ref{sec:multithreading}. Below 128~KiB of input, the overhead of threading
is high, and increasing the number of threads causes a performance drop. For
longer inputs, throughput scales almost linearly with the number of threads, up
to 16 threads. Above 16 threads, throughput levels off due to memory bandwidth
effects.
Figure~\ref{fig:rpizero} shows the throughput of single-threaded BLAKE3 on a
smaller embedded platform. This benchmark ran on a Raspberry Pi Zero with a
32-bit ARM1176 processor. Here, the biggest difference between algorithms is
whether they use 32-bit or 64-bit words. As noted above, BLAKE3 and BLAKE2s
have similar performance profiles here, because their compression functions are
closely related, and because there is no multi-core or SIMD parallelism for
BLAKE3 to exploit.
Figures~\ref{fig:avx512} and~\ref{fig:rpizero} also highlight that BLAKE3
performs well for very short inputs, at or below its 64-byte block size. Very
short inputs get padded up to a full block, and algorithms with smaller block
sizes benefit from compressing less padding. The fixed-interleaved tree
structures of BLAKE2bp and BLAKE2sp are especially costly when the input is
short, because they always compress a parent node and a fixed number of leaves.
The advantage of a fixed-interleaved tree structure comes at medium input
lengths, where BLAKE3 has not yet reached a high degree of parallelism.
This is the regime around 1--4\,KiB in Figure~\ref{fig:avx512}, where BLAKE2bp
and BLAKE2sp pull ahead.
The SHA\=/2 implementations used in these benchmarks are from OpenSSL 1.1.1.d,
and the KangarooTwelve implementations are from the eXtended Keccak Code
Package (compiled with \texttt{gcc -Os} for the ARM1176).
\section{Security}\label{sec:security}
\subsection{Security Goals}\label{sec:goals}
BLAKE3 targets $128$-bit security for all of its security goals. That
is, $128$-bit security against (second-)preimage, collision, or
differentiability attacks.
The rationale for targeting $128$-bit security can be found
in~\cite[\S2]{TMC}.
The key length is nevertheless $256$ bits, as an extra defense layer
against weakly chosen keys, key material leaks, and potentially
multi-target attacks.
%Users who need $256$-bit security can remain using BLAKE2.
\subsection{Compression Function}\label{sec:compressindiff}
The BLAKE3 compression function is, at a high level, identical to that of BLAKE2s, with
the exception that:
\begin{itemize}
\item BLAKE2's $f_0$ and $f_1$ ``flags'' are replaced by the $b$ and $d$ inputs, which
encode the functional context of the compression function instance.
\item Finalization returns $16$ words instead of $8$, and only applies the feed-forward on the second half of the output.
\end{itemize}
The latter change causes the BLAKE3 compression function to no longer be indifferentiable, unlike its predecessor~\cite{DBLP:journals/tosc/LuykxMN16}. Indeed, an attacker with oracle access to $C(\dots)$ and $E(\dots)$ that obtains $y = C(x, m, b, d)$ can compute $(x, b, d) = E^{-1}(m, (y_l, y_h) \oplus (y_h, 0) \oplus (0, x_l))$; a simulator with access to a random oracle cannot perform the same inversion, and as such this scheme---much like Davies-Meyer---is not indifferentiable from a random oracle. This causes some surmountable complications in the next section.
We do note that the truncated compression function used in every compression except the root nodes is indifferentiable. The simulator and proof are very similar to those in~\cite{DBLP:journals/tosc/LuykxMN16}, with the only change being the removal of the feed-forward, which has no effect on security when truncating the output.\footnote{A recent work~\cite{DBLP:conf/asiacrypt/ChoiLL19} improves upon~\cite{DBLP:conf/fse/DodisRRS09,DBLP:journals/tosc/LuykxMN16} with a tighter bound for the indifferentiability of truncated permutations. A similar approach would work for BLAKE2 or truncated BLAKE3, but the benefits for our parameters are negligible.}
\subsection{Mode of Operation}\label{sec:mode}
One important aspect of any hash mode, parallel or otherwise, is its \emph{soundness}. Several works~\cite{DBLP:conf/fse/DodisRRS09,DBLP:journals/iacr/DaemenDA11,DBLP:journals/ijisec/BertoniDPA14,DBLP:journals/tosc/DaemenMA18} have studied the requirements for the mode of operation of a hash function such that, when instantiated with an ideal compression function, block cipher, or permutation, it remains indistinguishable from a random oracle. We adopt the requirements from Daemen et al.~\cite{DBLP:journals/tosc/DaemenMA18}, which are:
\begin{itemize}
\item Subtree-freeness, i.e, no valid tree may be a subtree of another;
\item Radical-decodability, i.e., preventing collisions by ambiguity of the tree shape;
\item Message-completeness, i.e., the entire message can be recovered from the inputs to the primitive in question.
\end{itemize}
\paragraph{Subtree-freeness}{
Subtree-freeness ensures that generalized length-extension attacks, like the ones that plagued Merkle-Damg{\aa}rd, cannot happen.
To ensure subtree-freeness, BLAKE3 defines the \flag{ROOT} flag, which is only set on the last compression output. Thus, for any valid tree, a valid subtree must have the same root node. There are two cases to consider here:
\begin{enumerate}
\item The \flag{ROOT} flag is combined with \flag{CHUNK_END}; in this case we are dealing with a single-chunk message, and the root nodes must be the same, as well as its parents, grandparents, etc, until we reach a \flag{CHUNK_START} node. Because both subtrees must start with \flag{CHUNK_START}, we conclude that both trees must be the same.
\item The \flag{ROOT} flag is applied to a parent node. Here the root node necessarily has two parents, each of which must be equal, recursing until they hit the same set of leaves marked by the \flag{CHUNK_END} flag.
\end{enumerate}
}
\paragraph{Radical-decodability}{
For radical-decodability, we can define the subset of final nodes. For each final node, one can define a function $\text{radical}()$ that returns a CV position for any final node. Here we have, again, two cases:
\begin{enumerate}
\item The final node has the \flag{CHUNK_END} flag set; here the radical is in the CV bits.
\item The final node is a parent node; here the CV comes in the message bits.
\end{enumerate}
}
\paragraph{Message-completeness}{
The entire input can be recovered from the message bits input to the chunks.
}
If only considering 256-bit outputs or shorter, by \cite[Theorem~1]{DBLP:journals/tosc/DaemenMA18}, for a distinguisher $\mathcal{D}$ of total complexity $q$ we have
\begin{align}
\mathsf{Adv}_{BLAKE3}^{\text{diff}}(\mathcal{D}) \le & ~\frac{\binom{q}{2}}{2^{256}} +\label{eq:thm1} \\
& ~\frac{\binom{q}{2}}{2^{512}} + \frac{\binom{q}{2}}{2^{256}} + \frac{q}{2^{128}} \label{eq:thm2} \,.
\end{align}
Here (\ref{eq:thm1}) corresponds to the contribution of \cite[Theorem~1]{DBLP:journals/tosc/DaemenMA18} and (\ref{eq:thm2}) corresponds to the indifferentiability of the truncated compression function~\cite{DBLP:journals/tosc/LuykxMN16}.
For the full-length output, we do not have a truncated ideal cipher, but instead use a feed-forward to prevent inversion. We remark, however, that the truncation requirement in \cite[Theorem~3]{DBLP:journals/tosc/DaemenMA18} is in place purely to prevent the distinguisher from inverting an output; as such, the feed-forward serves the same purpose as truncation, and the result still follows as long as the other soundness requirements are in place.
\subsection{Cryptanalysis}\label{sec:cryptanalysis}
\begin{table}[t]
\centering
\caption{Summary of cryptanalysis on BLAKE and BLAKE2.}%
\label{tab:cryptanalysis}
\begin{tabular}{lcccc}
\toprule
Primitive & Type & Rounds & Complexity & Reference \\
\midrule
BLAKE-256 k.p. & Boomerang & 7 & $2^{44}$ & \cite{DBLP:journals/iet-ifs/BaiYWW15,DBLP:conf/cisc/Hao14} \\
BLAKE-256 k.p. & Boomerang & 8 & $2^{198}$ & \cite{DBLP:conf/cisc/Hao14} \\
BLAKE-256 k.p. & Differential & 4 & $2^{192}$ & \cite{DK11} \\
BLAKE-256 k.p. & Differential & 6 & $2^{456}$ & \cite{DK11} \\
BLAKE-256 k.p. & Impossible Differential & 6.5 & -- & \cite{DBLP:conf/ctrsa/0001KNWW14} \\
BLAKE-256 c.f. & Boomerang & 7 & $2^{242}$ & \cite{DBLP:conf/fse/BiryukovNR11} \\
BLAKE-256 c.f. & Near collision (152/256) & 4 & $2^{21}$ & \cite{DBLP:conf/cans/SuWWD10} \\
BLAKE-256 c.f. & Near collision (232/256) & 4 & $2^{56}$ & \cite{DBLP:conf/fse/AumassonGKMM10} \\
BLAKE-256 c.f. & Preimage & 2.5 & $2^{241}$ & \cite{DBLP:journals/iacr/JiL09} \\
BLAKE-256 c.f. & Pseudo-preimage & 6.75 & $2^{253.9}$ & \cite{DBLP:conf/crypto/EspitauFK15} \\
\midrule
BLAKE2s k.p. & Boomerang & 7.5 & $2^{184}$ & \cite{DBLP:conf/cisc/Hao14} \\
BLAKE2s k.p. & Impossible Differential & 6.5 & -- & \cite{DBLP:conf/ctrsa/0001KNWW14} \\
BLAKE2s k.p. & Rotational & 4 & $2^{489}$ & \cite{DBLP:conf/ctrsa/0001KNWW14,DBLP:conf/fse/KhovratovichNPS15} \\
BLAKE2s c.f. & Boomerang & 5 & $\approx 2^{86}$ & \cite{DBLP:conf/fse/BiryukovNR11,DBLP:conf/ctrsa/0001KNWW14} \\
BLAKE2s c.f. & Pseudo-preimage & 6.75 & $2^{253.8}$ & \cite{DBLP:conf/crypto/EspitauFK15} \\
\bottomrule
\end{tabular}
\end{table}
The round reduction from 10 to 7 rounds is based on existent cryptanalysis, along with novel research.
High confidence in the security of 7 rounds follows from cryptanalysis
efforts over 10 years, first on
BLAKE~\cite{DBLP:conf/cans/SuWWD10,DBLP:journals/ipl/VidaliNP10,DBLP:journals/iet-ifs/BaiYWW15,DBLP:conf/fse/AumassonGKMM10,DBLP:journals/iacr/JiL09,DK11,DBLP:conf/fse/BiryukovNR11}
and then on BLAKE2~\cite{DBLP:conf/ctrsa/0001KNWW14,DBLP:conf/cisc/Hao14,DBLP:conf/crypto/EspitauFK15}. Table~\ref{tab:cryptanalysis} summarizes the current state of the art in BLAKE(2) cryptanalysis.
Furthermore, our own searches for good\footnote{Optimal under certain assumptions, which are not entirely correct, cf. \cite{cryptoeprint:2013:328}. Nevertheless we find these results useful as a first approximation.} differential trails, shown in Table~\ref{tab:trails}, show that by restricting input differences the initialization of BLAKE2/3 makes things harder for the attacker, with probabilities quickly becoming very low. This was also observed in~\cite[\S7]{DBLP:conf/ctrsa/0001KNWW14}.
The boomerang attacks of, e.g.,
\cite{DBLP:conf/cisc/Hao14,DBLP:conf/fse/BiryukovNR11,DBLP:journals/iet-ifs/BaiYWW15}
exploit the high probabilities on the first few rounds to connect two
trails and get through a few more rounds. This is an established
technique to get around the quick lowering of differential
characteristics' probabilities. But such attacks do not present a threat
to BLAKE3, considering its input constraints on the compression function
(that is, the IV constants), plus the $128$-bit security target.
The general survey and analysis in~\cite{TMC} also pleads in favor
of reduced round values, arguing that BLAKE2s and BLAKE2b would be as
safe with 7 and 8 rounds, respectively.
\begin{table}[t]
\centering
\caption{Best differential trail probabilities for increasing round numbers of the compression functions of BLAKE-256, BLAKE2s, and BLAKE3, respecting their constraints on inputs to the keyed permutation. Probabilities marked with ${}^\ast$ indicate rotationally symmetric differences were sought exclusively.}%
\label{tab:trails}
\begin{tabular}{lccccccc}
\toprule
Function & 0.5 & 1 & 1.5 & 2 & 2.5 & 3 & 3.5 \\ \midrule
BLAKE-256 k.p/c.f. & $2^{-0}$ & $2^{-0}$ & $2^{-0}$ & $2^{-1}$ & $2^{-6}$ & $2^{-7}$ & $2^{-38}$ \\
BLAKE2s k.p. & $2^{-0}$ & $2^{-0}$ & $2^{-0}$ & $2^{-1}$ & $2^{-6}$ & $2^{-7}$ & $2^{-38}$ \\
BLAKE3 k.p. & $2^{-0}$ & $2^{-0}$ & $2^{-0}$ & $2^{-1}$ & $2^{-7}$ & $2^{-21}$ & $2^{-52}$ \\
BLAKE2s c.f. & $2^{-0}$ & $2^{-0}$ & $2^{-0}$ & $2^{-1}$ & $2^{-32}$ & $2^{-88\le w < -48}$ & - \\
BLAKE3 c.f. & $2^{-0}$ & $2^{-0}$ & $2^{-0}$ & $2^{-1}$ & $2^{-32}$ & $2^{-153 \le w < -48}$ & - \\
BLAKE\{-256,2s\} (hash) & $2^{-0}$ & $2^{-1}$ & $2^{-32}$ & ${\ge 2^{-190}}^\ast$ & - & - & - \\ % 2^{-190} with 8-symmetric $\Delta$, $2^{-230}$ with 4-symmetric $\Delta$ m10 = 0x88888888
BLAKE3 (hash) & $2^{-0}$ & $2^{-1}$ & $2^{-29}$ & ${\ge 2^{-185}}^\ast$ & - & - & - \\ % 2^{-196} 4-symmetric, 2^{-185} 8-symmetric
BLAKE-256 c.f. ($m=0$) & $2^{-0}$ & $2^{-2}$ & $2^{-12}$ & $2^{-39}$ & - & - & - \\
BLAKE\{2s,3\} c.f. ($m=0$) & $2^{-0}$ & $2^{-8}$ & $2^{-32}$ & ${\ge 2^{-71}}$ & - & - & - \\ % 2^{-161} with 8-symmetric $\Delta$, $2^{-161}$ with 4-symmetric $\Delta$
\bottomrule
\end{tabular}
\end{table}
\section{Implementation}\label{sec:implementation}
\subsection{Incremental Hashing}\label{sec:incremental}
An incremental implementation of BLAKE3 has two major components: the state of
the current chunk and a stack of subtree chaining values (the ``CV stack'').
The chunk state is structurally similar to an incremental implementation of
BLAKE2b or BLAKE2s, and it will be familiar to implementers of other hash
functions. The CV stack is less familiar. Simplifying the management of the CV
stack is especially important for a clear and correct implementation of BLAKE3.
This section describes how this is done in the
\href{https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_impl/reference_impl.rs}{reference
implementation}, as an aid to implementers.
\subsubsection{Chunk State}\label{sec:chunkstate}
The chunk state contains the 32-byte CV of the previous block and a 64-byte
input buffer for the next block, and typically also the compression function
parameters $t$ and $d$. Input bytes from the caller are copied into the buffer
until it is full. Then the buffer is compressed together with the previous CV,
using the truncated compression function. The output CV overwrites the previous
CV, and the buffer is cleared. An important detail here is that the last block
of a chunk is compressed differently, setting the \flag{CHUNK_END} flag and
possibly the \flag{ROOT} flag. In an incremental setting, any block could
potentially be the last, until the caller has supplied enough input to place at
least 1 byte in the block after. For that reason, the chunk state waits to
compress the next block until both the buffer is full and the caller has
supplied more input. Note that the CV stack takes the same approach immediately
below: chunk CVs are added only after the caller supplies at least 1 byte for
the following chunk.
\subsubsection{Chaining Value Stack}\label{sec:cvstack}
To help picture the role of the CV stack, Figure~\ref{fig:incrementaltrees}
shows a growing tree as chunk CVs are added incrementally. As just discussed
above, chunk CVs are added to this tree only after the caller has supplied at
least 1 byte for the following chunk, so we know that none of these chunks or
parent nodes is the root of the tree, and we do not need to worry about the
\flag{ROOT} flag yet.
\begin{figure}[h]
\centering
\input{img/cvstack.tikz}
\caption{An incomplete tree growing incrementally from 1 to 4 chunks. Dotted
boxes represent CVs that no longer need to be stored.}
\label{fig:incrementaltrees}
\end{figure}
When the first chunk CV ($c_0$) is added, it is alone. When the second chunk CV
($c_1$) is added, it completes a two-chunk subtree, and we can compute the
first parent CV ($p_0$). The third chunk CV ($c_2$) does not complete any
subtrees. Its final position in the tree structure will depend on whether it
gets a right sibling (see Figure~\ref{fig:fourchunks}), so we cannot create any
parent nodes for it yet. The fourth chunk CV ($c_3$) provides a right sibling
for $c_2$ and completes two subtrees, one of two chunks and one of four chunks.
First it merges with $c_2$ to compute $p_1$, then $p_1$ merges with $p_0$ to
compute $p_2$.
Note that once a subtree is complete, none of its child CVs will be used again,
and we do not need to store them any longer. These unneeded CVs are represented
by dotted boxes in Figure~\ref{fig:incrementaltrees}. Solid boxes represent CVs
that are still needed. These are what we store in the CV stack, ordered based
on when we computed them, with newer CVs on top.
Look through Figure~\ref{fig:incrementaltrees} again, this time paying
attention to the state of the CV stack at each step. When $c_0$ is added to the
tree, it is the only entry in the stack. When $c_1$ is added, we pop $c_0$ off
the stack to merge it, and $p_0$ becomes the only entry. When $c_2$ is added,
there are two entries in the stack, with $c_2$ on top because it is newer. When
$c_3$ is added, we perform two merges, first popping off $c_2$ and then popping
off $p_0$, and only $p_2$ remains in the stack.
Note how the stack order of $c_2$ and $p_0$ above matches the order in which we
need to retrieve them for merging. By ordering CVs in the stack from newest at
the top to oldest at the bottom, we also implicitly order them from tree-right
to tree-left, and from the smallest subtree to the largest subtree. This
invariant means that the next CV we need to merge is always on top of the
stack.
The key question is then, when a new chunk CV is added to the tree, how many
subtrees does it complete? Which is to say, how many CVs should we pop off the
stack and merge with it, before pushing on the final result?
To answer this question, note another pattern. The entries in the CV stack
behave like the 1-bits in the binary representation of the total number of
chunks so far. For example with 3 chunks (\texttt{0b11}), there are two entries in the
stack, corresponding to the two 1-bits. With 4 chunks (\texttt{0b100}), only one entry
remains, corresponding to the single 1-bit.
To see why this pattern holds, note that any completed subtree represented by a
CV in the stack contains a number of chunks that is a power of 2. Furthermore,
each subtree is a distinct power of 2, because when we have two subtrees of the
same size, we always merge them. Thus, the subtrees in the stack correspond to
the distinct powers of 2 that sum up to the current total number of chunks.
This is equivalent to the 1-bits in the binary representation of that number.
In this sense, popping a CV off the stack to merge it is like flipping a 1-bit
to 0, and pushing the final result back onto the stack is like flipping a 0-bit
to 1. The fact that we do two merges when adding the fourth chunk, corresponds
to the fact that two 1-bits flip to 0 as the total changes from 3 to 4. In
other words, we do two merges when we add the fourth chunk, because there are
two trailing 1-bits in the number 3, and equivalently two trailing 0-bits in
the number 4.
This pattern leads to an algorithm for adding a new chunk CV to the tree: For
each trailing 0-bit in the new total number of chunks, pop a CV off the stack,
and merge it with the new CV we were about to add. Finally, push the fully
merged result onto the stack. Listing~\ref{listing:push_chunk_chaining_value}
shows this algorithm as it appears in the Rust reference implementation.
\begin{listing}[h]
\begin{minted}[fontsize=\footnotesize]{rust}
fn add_chunk_chaining_value(&mut self, mut new_cv: [u32; 8], mut total_chunks: u64) {
// This chunk might complete some subtrees. For each completed subtree,
// its left child will be the current top entry in the CV stack, and
// its right child will be the current value of `new_cv`. Pop each left
// child off the stack, merge it with `new_cv`, and overwrite `new_cv`
// with the result. After all these merges, push the final value of
// `new_cv` onto the stack. The number of completed subtrees is given
// by the number of trailing 0-bits in the new total number of chunks.
while total_chunks & 1 == 0 {
new_cv = parent_cv(self.pop_stack(), new_cv, self.key, self.flags);
total_chunks >>= 1;
}
self.push_stack(new_cv);
}
\end{minted}
\caption{The algorithm in the Rust reference implementation that manages the
chaining value stack when a new chunk CV is added.}
\label{listing:push_chunk_chaining_value}
\end{listing}
Once we reach the end of the input, we can use a simpler algorithm to compute
the root hash. Finalize the CV of the current chunk, which may be incomplete or
empty. Then, merge that CV with every CV currently in the stack, regardless of
the number of chunks so far. Set the \flag{ROOT} flag for the last parent node
to be merged. Or if there are no CVs in the stack, and thus no parents to
merge, set the \flag{ROOT} flag for chunk finalization instead.
\subsection{Multi-threading}\label{sec:multithreading}
Most of the work of computing a BLAKE3 hash is compressing chunks. Each chunk
can be compressed independently, and one approach to multi-threading is to farm
out individual chunks or groups of chunks to tasks on a thread pool. In this
approach, a leader thread owns the chaining value stack (see
\S\ref{sec:cvstack}) and awaits a CV from each task in order.
This leader-workers approach has some inefficiencies. Spawning tasks and
creating channels usually requires heap allocation, which is a performance cost
that needs to be amortized over larger groups of chunks. At high degrees of
parallelism, managing the CV stack itself can become a bottleneck.
A recursive divide-and-conquer approach can be more efficient. The input is
split into left and right parts. As per the rules in \S\ref{sec:tree}, the left
part receives the largest power-of-2 number of chunks that leaves at least 1
byte for the right part. Each part then repeats this splitting step
recursively, until the parts are chunk-sized, and each chunk is compressed into
a CV. On the way back up the call stack, each pair of left and right child CVs
is compressed into a parent CV.
This recursive approach is good fit for a fork-join concurrency model, like
those provided by OpenMP, Cilk, and Rayon (Rust). Each left and right part
becomes a separate task, and a work-stealing runtime parallelizes those tasks
across however many threads are available. This can work without heap
allocation, because the runtime can make a fixed-size stack allocation at each
recursive call site.
The recursive approach is simplest when the entire input is available at once,
since no CV stack is needed in that case. In an incremental setting, a hybrid
approach is also possible. A large buffer of input can be compressed
recursively into a single subtree CV, and that CV can be pushed onto the CV
stack using the same algorithm as in \S\ref{sec:cvstack}. If each incremental
input is a fixed power-of-2 number of chunks in size (for example if all input
is copied into an internal buffer before compression), the algorithm works with
no modification. If each incremental input is instead a variable size (for
example if input from the caller is compressed directly without copying), the
implementation needs to carefully maintain the largest-to-smallest ordering
invariant of the CV stack. The size of each subtree being compressed must
evenly divide the total number of input bytes received so far, and to achieve
that the implementation might need to break up an incremental input into
smaller pieces. In the non-buffering case, the implementation must also take
care not to compress a parent node that could be the root, when it is unknown
whether more input is coming.
\subsection{SIMD}\label{sec:simd}
There are two primary approaches to using SIMD in a BLAKE3 implementation, and both are
important for high performance at different input lengths. The first approach
is to use 128-bit vectors to represent the 4-word rows of the state matrix. The
second approach is to use vectors of any size to represent words in multiple
states, which are compressed in parallel.
The first approach is similar to how SIMD is used in BLAKE2b or BLAKE2s, and it
is applicable to inputs of any length, particularly short inputs where the
second approach does not apply. The state $v_0 \ldots v_{15}$ is arranged into
four 128-bit vectors. The first vector contains the state words $v_0 \ldots
v_3$, the second vector contains the state words $v_4 \ldots v_7$, and so on.
Implementing the G function (see \S\ref{sec:compression}) with vector
instructions thus mixes all four columns of the state matrix in parallel. A
diagonalization step then rotates the words within each row so that each
diagonal now lies along a column, and the vectorized G function is repeated to
mix diagonals. Finally the state is undiagonalized, to prepare for the column
step of the following round.
The second approach is similar to how SIMD is used in BLAKE2bp or BLAKE2sp. In
this approach, multiple chunks are compressed in parallel, and each vector
contains one word from the state matrix of each chunk. That is, the first
vector contains the $v_0$ word from each state, the second vector contains the
$v_1$ word from each state, and so on, using 16 vectors in total. The width of
the vectors determines the number of chunks, so for example 128-bit vectors
compress 4 chunks in parallel, and 256-bit vectors compress 8 chunks in
parallel. Here the G function operates on one column or diagonal at a time, but
across all of the states, and no diagonalization step is required. When enough
input is available, this approach is much more efficient than the first
approach. It also scales to wider instruction sets like AVX2 and AVX-512.
The second approach can be integrated with the CV stack algorithm from
\S\ref{sec:cvstack} by computing multiple chunk CVs in parallel and then
pushing each of them into the CV stack one at a time. It can also be combined
with either of the multi-threading strategies from \S\ref{sec:multithreading}.
Rather than having each task or recursive leaf compress one chunk at a time,
each can compress multiple chunks in parallel.
A third approach is also possible, though it is less widely applicable than the
first two. Entire 4-word rows can be stored together within a vector, as in the
first approach, but larger vectors can be used to store multiple such rows,
each belonging to a different state matrix. For example, 256-bit vectors can
store two full rows, thus compressing two chunks in parallel. Diagonalization
and undiagonalization steps are still required in this arrangement, so the
second approach is preferable when enough input is available. However, on
platforms like x86 that do not support 64-bit vectors, this approach is the
only way to implement 2-way SIMD parallelism for 2-chunk inputs. The benefit of
using wider vectors may also outweigh the cost of diagonalization, though this
varies between different microarchitectures. The performance of this approach
is visible at 2~KiB and 4~KiB in Figure~\ref{fig:avx512}.
\subsection{Memory Requirements}\label{sec:memory}