Improve performance by 2x #31

nistath · 2023-02-13T05:41:04Z

TODO

The first commit of this PR has been reverted until @hauntsaninja can confirm PCRE2 behaves as expected.

Summary

This PR contributes a 2x performance improvement via optimizations that I implemented this weekend.
It achieves this improvement by using PCRE2 for regex parsing and by changing byte_pair_merge to reduce rank lookups, reduce branching, and improve CPU-cache efficiency.

List of changes

In particular, the improvements to byte_pair_merge are:

Changing the parts vector to avoid repetition of data.

This vector used to store ranges for which the invariant parts[i].end == parts[i + 1].start holds, which makes the vector twice as big as it needs to be.
Keeping this vector small improves CPU-cache efficiency.
Using usize::MAX as a sentinel in lieu of Optional for the computation of the minimum rank.

This change removes branching from the loop to compute the minimum rank, generating assembly that uses conditional moves instead.

Ideally, we could keep the Optional and inform it of the sentinel much like Optional<NonZeroUsize>. As far as I could tell, specifying custom sentinels for Optional has an old Rust RFC that has stalled, so we don't get to have nice things.
Minimizing the number of lookups into ranks by looking up ranks once and iteratively updating them after each merge.

This reduces the number of rank lookups from n*m to n + O(m).

Benchmarking

Background

This gist contains supplemental information for the benchmarks I used.
Before collecting measurements, I quiesced my laptop (Intel i7-7700HQ @ 2.80GHz) by setting up isolcpus and following this guide.
The benchmarks were run with --reps 30.

The only change my code makes to parallel performance is the thread local model for the regex parser. The PCRE2 library is mostly thread-safe and the rust-pcre2 bindings crate I forked uses the thread_local crate for any scratch space used for regex matching. I have left notes on this in the diff.
I do not repeat the scaling measurements as I don't have that many cores, but I do one measurement to validate that the above difference doesn't play a big role.

Datasets

Since the original benchmarking dataset is not available, I used two files: 64KB of Lorem Ipsum (lorem_ipsum_64k.txt) and 64K emojis (252KB) generated with ./emoji_gen.py 64000 > emojis.txt. The Lorem Ipsum file and the script to generate random emoji sequences are available in the gist.

Measurements

I compare 3 incremental versions of tiktoken: the original, the one with PCRE2 (commit 1), and the one with PCRE2 and my byte_pair_merge optimizations (commit 2).
The original version of tiktoken fails to complete emojis.txt and I gave up on waiting for it after a few minutes (compared to 24 ms for the version in this PR).

The following table shows the results in bytes per second as well as incremental speedups.

encoding	file	cores	original	→	commit 1	→	commit 2	total speedup
`gpt2`	`lorem_ipsum_64k`	1	4224157	75.11%	7396769	40.83%	10417089	2.47
`r50k_base`	`lorem_ipsum_64k`	1	4225109	76.48%	7456639	36.46%	10175065	2.41
`gpt2`	`lorem_ipsum_64k`	4	14436023	54.91%	22363385	27.29%	28465609	1.97
`gpt2`	`emojis`	1	0	∞	10659178	2.38%	10912785	∞
`r50k_base`	`emojis`	1	0	∞	10413159	3.10%	10736463	∞

With commit 1 reverted, the speedups are 15% for lorem_ipsum_64k and emojis terminates with a whopping 13933 bytes per second throughput. The speedups of commit 2 are more evident when the core bottleneck of regex is dealt with first, but they're still important in isolation.

Testing

Since most of the tests are not available either, I simply ran both the original and my version against a bunch of text and ensured the outputs were the same.

Discussion

I also briefly attempted using a heap, as suggested in the comments.
I quickly abandoned that approach because, as mentioned in the diff, n = pieces.len() is very small (usually less than 10) and as such it is more cache-efficient to just loop through a vector than it is to operate on a heap. This cache-efficiency vastly outweighs any algorithmic speedup.

Also, it is interesting to note that different datasets spend a different ratio of time between regex lookups and byte_pair_merge, with lorem_ipsum_64k.
In the original code on lorem_ipsum_64k the regex and byte_pair_merge took a roughly equal amount of time according to perf.
From the table above, it is evident that emojis is mostly bottlenecked on regex matching.

hauntsaninja

Wow, thanks, these changes look awesome!

Would you mind separating these changes out into two PRs? The changes to _byte_pair_merge look great and I can merge them quickly, but I'd need to do a little more homework on pcre2. Specifically, I'd need to:

Make sure that pcre2 doesn't cause problems for any internal-only regexes we might be using
Get approval for using your forked bindings. This could be easiest if your fork could be upstreamed, but is likely fine to install from git as long as we pin a commit hash
The Python bindings in encode_batch currently do not reuse threads. Easily fixable, but I do want to take the time to understand myself the pcre2 details when it comes to threads :-)

hauntsaninja · 2023-02-13T08:47:02Z

src/lib.rs

+
+    // NOTE: using a macro here because a closure fails to get inlined
+    // according to optimization remarks.
+    macro_rules! get_rank {


I think you can #[inline(always)] a closure, you just might need to surround it in braces

Unfortunately, error[E0658]: attributes on expressions are experimental. So error[E0518]: attribute should be applied to function or closure is a little optimistic for now.

Also, the best closure I can come up with is

let get_rank_skip = |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize| { if (start_idx + skip + 2) < parts.len() { ranks .get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0]) .map(|r| *r) } else { None } }; let get_rank = |parts: &Vec<(usize, usize)>, start_idx: usize| { get_rank_skip(parts, start_idx, 0) };

Passing parts by reference, even though it can be captured, seems to be necessary. Otherwise, the borrow checker will complain that we mutably borrow (when assigning to parts) while the closure has an immutable borrow via the captured reference. It's possible there is a better solution, but I am not aware of one.
I think the best option is to keep the macro.

Generates identical assembly as the prior commit.

This reverts commit f775e7f.

hauntsaninja · 2023-02-26T04:36:13Z

Thanks again for the PR!

I got some time to reproduce benchmarks of the current version of this PR. On your lorem ipsum, it was about a 16% improvement. On internal datasets, it was somewhere between 5-20% better.

I tried benchmarking the version with PCRE, but I couldn't get it to work. I ran into undefined symbol: _pcre2_script_run_8 / symbol not found in flat namespace (__pcre2_script_run_8) on a work machine / laptop; is it obvious to you what the issue is there?

Re the macro, the following compiles and seems to perform equivalently (the trick is rustc wants more curly braces to let you put the attribute on the closure), but I didn't confirm the assembly looks the same:

diff --git a/tiktoken/src/lib.rs b/tiktoken/src/lib.rs
index b44d9c8b21..40241be910 100644
--- a/tiktoken/src/lib.rs
+++ b/tiktoken/src/lib.rs
@@ -26,10 +26,9 @@ fn _byte_pair_merge<T>(
     // A closure also cannot capture a reference to `piece` without
     // the borrow checker complaining about the mutable borrows during
     // the assignments later in this code.
-    macro_rules! get_rank {
-        ($start_idx:expr, $skip:expr) => {{
-            let start_idx: usize = $start_idx;
-            let skip: usize = $skip;
+    let get_rank = {
+        #[inline(always)]
+        |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize| {
             if (start_idx + skip + 2) < parts.len() {
                 ranks
                     .get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0])
@@ -37,16 +36,13 @@ fn _byte_pair_merge<T>(
             } else {
                 None
             }
-        }};
-        ($idx:expr) => {{
-            get_rank!($idx, 0)
-        }};
-    }
+        }
+    };
 
     // We look up the ranks once in the beggining and iteratively update
     // them during each merge, which reduces the number of rank lookups.
     for i in 0..parts.len() - 2 {
-        match get_rank!(i) {
+        match get_rank(&parts, i, 0) {
             Some(rank) => {
                 // usize::MAX is a sentinel value and cannot be a valid rank
                 debug_assert!(rank != usize::MAX);
@@ -89,9 +85,9 @@ fn _byte_pair_merge<T>(
             // parts[i] and parts[i-1] before removing, which could thrash
             // the cache. Thus, we update the rank calculation by skipping over
             // parts[i + 1], by invoking `get_rank!` with `skip = 1`.
-            parts[i].1 = get_rank!(i, 1).unwrap_or(usize::MAX);
+            parts[i].1 = get_rank(&parts, i, 1).unwrap_or(usize::MAX);
             if i > 0 {
-                parts[i - 1].1 = get_rank!(i - 1, 1).unwrap_or(usize::MAX);
+                parts[i - 1].1 = get_rank(&parts, i - 1, 1).unwrap_or(usize::MAX);
             }
 
             parts.remove(i + 1);

nistath added 2 commits February 12, 2023 21:34

Use PCRE2 for regex

f775e7f

Cache and lookup efficient _byte_pair_merge

c9379aa

hauntsaninja reviewed Feb 13, 2023

View reviewed changes

michaelgiba mentioned this pull request Feb 13, 2023

Improve time complexity of '_byte_pair_merge' and add test #29

Closed

nistath added 2 commits February 19, 2023 15:10

Improve get_rank! and add comments

a9b615a

Generates identical assembly as the prior commit.

Revert "Use PCRE2 for regex"

5ca6ffd

This reverts commit f775e7f.

nistath requested a review from hauntsaninja February 19, 2023 23:28

christopher-hesse mentioned this pull request Feb 20, 2023

Performance ideas #33

Open

hauntsaninja merged commit c4b8770 into openai:main Feb 26, 2023

hauntsaninja mentioned this pull request Feb 11, 2024

Simplify byte_pair_merge #255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance by 2x #31

Improve performance by 2x #31

nistath commented Feb 13, 2023 •

edited

Loading

hauntsaninja left a comment

hauntsaninja Feb 13, 2023

nistath Feb 19, 2023

nistath Feb 19, 2023

hauntsaninja commented Feb 26, 2023 •

edited

Loading

Improve performance by 2x #31

Improve performance by 2x #31

Conversation

nistath commented Feb 13, 2023 • edited Loading

TODO

Summary

List of changes

Benchmarking

Background

Datasets

Measurements

Testing

Discussion

hauntsaninja left a comment

Choose a reason for hiding this comment

hauntsaninja Feb 13, 2023

Choose a reason for hiding this comment

nistath Feb 19, 2023

Choose a reason for hiding this comment

nistath Feb 19, 2023

Choose a reason for hiding this comment

hauntsaninja commented Feb 26, 2023 • edited Loading

nistath commented Feb 13, 2023 •

edited

Loading

hauntsaninja commented Feb 26, 2023 •

edited

Loading