Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by 2x #31

Merged
merged 4 commits into from
Feb 26, 2023
Merged

Improve performance by 2x #31

merged 4 commits into from
Feb 26, 2023

Conversation

nistath
Copy link
Contributor

@nistath nistath commented Feb 13, 2023

TODO

The first commit of this PR has been reverted until @hauntsaninja can confirm PCRE2 behaves as expected.

Summary

This PR contributes a 2x performance improvement via optimizations that I implemented this weekend.
It achieves this improvement by using PCRE2 for regex parsing and by changing byte_pair_merge to reduce rank lookups, reduce branching, and improve CPU-cache efficiency.

List of changes

In particular, the improvements to byte_pair_merge are:

  • Changing the parts vector to avoid repetition of data.

    This vector used to store ranges for which the invariant parts[i].end == parts[i + 1].start holds, which makes the vector twice as big as it needs to be.
    Keeping this vector small improves CPU-cache efficiency.

  • Using usize::MAX as a sentinel in lieu of Optional for the computation of the minimum rank.

    This change removes branching from the loop to compute the minimum rank, generating assembly that uses conditional moves instead.

    Ideally, we could keep the Optional and inform it of the sentinel much like Optional<NonZeroUsize>. As far as I could tell, specifying custom sentinels for Optional has an old Rust RFC that has stalled, so we don't get to have nice things.

  • Minimizing the number of lookups into ranks by looking up ranks once and iteratively updating them after each merge.

    This reduces the number of rank lookups from n*m to n + O(m).

Benchmarking

Background

This gist contains supplemental information for the benchmarks I used.
Before collecting measurements, I quiesced my laptop (Intel i7-7700HQ @ 2.80GHz) by setting up isolcpus and following this guide.
The benchmarks were run with --reps 30.

The only change my code makes to parallel performance is the thread local model for the regex parser. The PCRE2 library is mostly thread-safe and the rust-pcre2 bindings crate I forked uses the thread_local crate for any scratch space used for regex matching. I have left notes on this in the diff.
I do not repeat the scaling measurements as I don't have that many cores, but I do one measurement to validate that the above difference doesn't play a big role.

Datasets

Since the original benchmarking dataset is not available, I used two files: 64KB of Lorem Ipsum (lorem_ipsum_64k.txt) and 64K emojis (252KB) generated with ./emoji_gen.py 64000 > emojis.txt. The Lorem Ipsum file and the script to generate random emoji sequences are available in the gist.

Measurements

I compare 3 incremental versions of tiktoken: the original, the one with PCRE2 (commit 1), and the one with PCRE2 and my byte_pair_merge optimizations (commit 2).
The original version of tiktoken fails to complete emojis.txt and I gave up on waiting for it after a few minutes (compared to 24 ms for the version in this PR).

The following table shows the results in bytes per second as well as incremental speedups.

encoding file cores original commit 1 commit 2 total speedup
gpt2 lorem_ipsum_64k 1 4224157 75.11% 7396769 40.83% 10417089 2.47
r50k_base lorem_ipsum_64k 1 4225109 76.48% 7456639 36.46% 10175065 2.41
gpt2 lorem_ipsum_64k 4 14436023 54.91% 22363385 27.29% 28465609 1.97
gpt2 emojis 1 0 10659178 2.38% 10912785
r50k_base emojis 1 0 10413159 3.10% 10736463

With commit 1 reverted, the speedups are 15% for lorem_ipsum_64k and emojis terminates with a whopping 13933 bytes per second throughput. The speedups of commit 2 are more evident when the core bottleneck of regex is dealt with first, but they're still important in isolation.

Testing

Since most of the tests are not available either, I simply ran both the original and my version against a bunch of text and ensured the outputs were the same.

Discussion

I also briefly attempted using a heap, as suggested in the comments.
I quickly abandoned that approach because, as mentioned in the diff, n = pieces.len() is very small (usually less than 10) and as such it is more cache-efficient to just loop through a vector than it is to operate on a heap. This cache-efficiency vastly outweighs any algorithmic speedup.

Also, it is interesting to note that different datasets spend a different ratio of time between regex lookups and byte_pair_merge, with lorem_ipsum_64k.
In the original code on lorem_ipsum_64k the regex and byte_pair_merge took a roughly equal amount of time according to perf.
From the table above, it is evident that emojis is mostly bottlenecked on regex matching.

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks, these changes look awesome!

Would you mind separating these changes out into two PRs? The changes to _byte_pair_merge look great and I can merge them quickly, but I'd need to do a little more homework on pcre2. Specifically, I'd need to:

  • Make sure that pcre2 doesn't cause problems for any internal-only regexes we might be using
  • Get approval for using your forked bindings. This could be easiest if your fork could be upstreamed, but is likely fine to install from git as long as we pin a commit hash
  • The Python bindings in encode_batch currently do not reuse threads. Easily fixable, but I do want to take the time to understand myself the pcre2 details when it comes to threads :-)


// NOTE: using a macro here because a closure fails to get inlined
// according to optimization remarks.
macro_rules! get_rank {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can #[inline(always)] a closure, you just might need to surround it in braces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, error[E0658]: attributes on expressions are experimental. So error[E0518]: attribute should be applied to function or closure is a little optimistic for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the best closure I can come up with is

let get_rank_skip = |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize|  {
    if (start_idx + skip + 2) < parts.len() {
        ranks
            .get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0])
            .map(|r| *r)
    } else {
        None
    }
};

let get_rank = |parts: &Vec<(usize, usize)>, start_idx: usize| {
    get_rank_skip(parts, start_idx, 0)
};

Passing parts by reference, even though it can be captured, seems to be necessary. Otherwise, the borrow checker will complain that we mutably borrow (when assigning to parts) while the closure has an immutable borrow via the captured reference. It's possible there is a better solution, but I am not aware of one.
I think the best option is to keep the macro.

Generates identical assembly as the prior commit.
@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Feb 26, 2023

Thanks again for the PR!

I got some time to reproduce benchmarks of the current version of this PR. On your lorem ipsum, it was about a 16% improvement. On internal datasets, it was somewhere between 5-20% better.

I tried benchmarking the version with PCRE, but I couldn't get it to work. I ran into undefined symbol: _pcre2_script_run_8 / symbol not found in flat namespace (__pcre2_script_run_8) on a work machine / laptop; is it obvious to you what the issue is there?

Re the macro, the following compiles and seems to perform equivalently (the trick is rustc wants more curly braces to let you put the attribute on the closure), but I didn't confirm the assembly looks the same:

diff --git a/tiktoken/src/lib.rs b/tiktoken/src/lib.rs
index b44d9c8b21..40241be910 100644
--- a/tiktoken/src/lib.rs
+++ b/tiktoken/src/lib.rs
@@ -26,10 +26,9 @@ fn _byte_pair_merge<T>(
     // A closure also cannot capture a reference to `piece` without
     // the borrow checker complaining about the mutable borrows during
     // the assignments later in this code.
-    macro_rules! get_rank {
-        ($start_idx:expr, $skip:expr) => {{
-            let start_idx: usize = $start_idx;
-            let skip: usize = $skip;
+    let get_rank = {
+        #[inline(always)]
+        |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize| {
             if (start_idx + skip + 2) < parts.len() {
                 ranks
                     .get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0])
@@ -37,16 +36,13 @@ fn _byte_pair_merge<T>(
             } else {
                 None
             }
-        }};
-        ($idx:expr) => {{
-            get_rank!($idx, 0)
-        }};
-    }
+        }
+    };
 
     // We look up the ranks once in the beggining and iteratively update
     // them during each merge, which reduces the number of rank lookups.
     for i in 0..parts.len() - 2 {
-        match get_rank!(i) {
+        match get_rank(&parts, i, 0) {
             Some(rank) => {
                 // usize::MAX is a sentinel value and cannot be a valid rank
                 debug_assert!(rank != usize::MAX);
@@ -89,9 +85,9 @@ fn _byte_pair_merge<T>(
             // parts[i] and parts[i-1] before removing, which could thrash
             // the cache. Thus, we update the rank calculation by skipping over
             // parts[i + 1], by invoking `get_rank!` with `skip = 1`.
-            parts[i].1 = get_rank!(i, 1).unwrap_or(usize::MAX);
+            parts[i].1 = get_rank(&parts, i, 1).unwrap_or(usize::MAX);
             if i > 0 {
-                parts[i - 1].1 = get_rank!(i - 1, 1).unwrap_or(usize::MAX);
+                parts[i - 1].1 = get_rank(&parts, i - 1, 1).unwrap_or(usize::MAX);
             }
 
             parts.remove(i + 1);

@hauntsaninja hauntsaninja merged commit c4b8770 into openai:main Feb 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants