-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance by 2x #31
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks, these changes look awesome!
Would you mind separating these changes out into two PRs? The changes to _byte_pair_merge
look great and I can merge them quickly, but I'd need to do a little more homework on pcre2. Specifically, I'd need to:
- Make sure that pcre2 doesn't cause problems for any internal-only regexes we might be using
- Get approval for using your forked bindings. This could be easiest if your fork could be upstreamed, but is likely fine to install from git as long as we pin a commit hash
- The Python bindings in encode_batch currently do not reuse threads. Easily fixable, but I do want to take the time to understand myself the pcre2 details when it comes to threads :-)
|
||
// NOTE: using a macro here because a closure fails to get inlined | ||
// according to optimization remarks. | ||
macro_rules! get_rank { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can #[inline(always)]
a closure, you just might need to surround it in braces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, error[E0658]: attributes on expressions are experimental
. So error[E0518]: attribute should be applied to function or closure
is a little optimistic for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the best closure I can come up with is
let get_rank_skip = |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize| {
if (start_idx + skip + 2) < parts.len() {
ranks
.get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0])
.map(|r| *r)
} else {
None
}
};
let get_rank = |parts: &Vec<(usize, usize)>, start_idx: usize| {
get_rank_skip(parts, start_idx, 0)
};
Passing parts
by reference, even though it can be captured, seems to be necessary. Otherwise, the borrow checker will complain that we mutably borrow (when assigning to parts) while the closure has an immutable borrow via the captured reference. It's possible there is a better solution, but I am not aware of one.
I think the best option is to keep the macro.
Generates identical assembly as the prior commit.
This reverts commit f775e7f.
Thanks again for the PR! I got some time to reproduce benchmarks of the current version of this PR. On your lorem ipsum, it was about a 16% improvement. On internal datasets, it was somewhere between 5-20% better. I tried benchmarking the version with PCRE, but I couldn't get it to work. I ran into Re the macro, the following compiles and seems to perform equivalently (the trick is rustc wants more curly braces to let you put the attribute on the closure), but I didn't confirm the assembly looks the same: diff --git a/tiktoken/src/lib.rs b/tiktoken/src/lib.rs
index b44d9c8b21..40241be910 100644
--- a/tiktoken/src/lib.rs
+++ b/tiktoken/src/lib.rs
@@ -26,10 +26,9 @@ fn _byte_pair_merge<T>(
// A closure also cannot capture a reference to `piece` without
// the borrow checker complaining about the mutable borrows during
// the assignments later in this code.
- macro_rules! get_rank {
- ($start_idx:expr, $skip:expr) => {{
- let start_idx: usize = $start_idx;
- let skip: usize = $skip;
+ let get_rank = {
+ #[inline(always)]
+ |parts: &Vec<(usize, usize)>, start_idx: usize, skip: usize| {
if (start_idx + skip + 2) < parts.len() {
ranks
.get(&piece[parts[start_idx].0..parts[start_idx + skip + 2].0])
@@ -37,16 +36,13 @@ fn _byte_pair_merge<T>(
} else {
None
}
- }};
- ($idx:expr) => {{
- get_rank!($idx, 0)
- }};
- }
+ }
+ };
// We look up the ranks once in the beggining and iteratively update
// them during each merge, which reduces the number of rank lookups.
for i in 0..parts.len() - 2 {
- match get_rank!(i) {
+ match get_rank(&parts, i, 0) {
Some(rank) => {
// usize::MAX is a sentinel value and cannot be a valid rank
debug_assert!(rank != usize::MAX);
@@ -89,9 +85,9 @@ fn _byte_pair_merge<T>(
// parts[i] and parts[i-1] before removing, which could thrash
// the cache. Thus, we update the rank calculation by skipping over
// parts[i + 1], by invoking `get_rank!` with `skip = 1`.
- parts[i].1 = get_rank!(i, 1).unwrap_or(usize::MAX);
+ parts[i].1 = get_rank(&parts, i, 1).unwrap_or(usize::MAX);
if i > 0 {
- parts[i - 1].1 = get_rank!(i - 1, 1).unwrap_or(usize::MAX);
+ parts[i - 1].1 = get_rank(&parts, i - 1, 1).unwrap_or(usize::MAX);
}
parts.remove(i + 1); |
TODO
The first commit of this PR has been reverted until @hauntsaninja can confirm PCRE2 behaves as expected.
Summary
This PR contributes a 2x performance improvement via optimizations that I implemented this weekend.
It achieves this improvement by using PCRE2 for regex parsing and by changing
byte_pair_merge
to reduce rank lookups, reduce branching, and improve CPU-cache efficiency.List of changes
In particular, the improvements to
byte_pair_merge
are:Changing the
parts
vector to avoid repetition of data.This vector used to store ranges for which the invariant
parts[i].end == parts[i + 1].start
holds, which makes the vector twice as big as it needs to be.Keeping this vector small improves CPU-cache efficiency.
Using
usize::MAX
as a sentinel in lieu ofOptional
for the computation of the minimum rank.This change removes branching from the loop to compute the minimum rank, generating assembly that uses conditional moves instead.
Ideally, we could keep the
Optional
and inform it of the sentinel much likeOptional<NonZeroUsize>
. As far as I could tell, specifying custom sentinels forOptional
has an old Rust RFC that has stalled, so we don't get to have nice things.Minimizing the number of lookups into
ranks
by looking up ranks once and iteratively updating them after each merge.This reduces the number of rank lookups from
n*m
ton + O(m)
.Benchmarking
Background
This gist contains supplemental information for the benchmarks I used.
Before collecting measurements, I quiesced my laptop (Intel i7-7700HQ @ 2.80GHz) by setting up isolcpus and following this guide.
The benchmarks were run with
--reps 30
.The only change my code makes to parallel performance is the thread local model for the regex parser. The PCRE2 library is mostly thread-safe and the
rust-pcre2
bindings crate I forked uses thethread_local
crate for any scratch space used for regex matching. I have left notes on this in the diff.I do not repeat the scaling measurements as I don't have that many cores, but I do one measurement to validate that the above difference doesn't play a big role.
Datasets
Since the original benchmarking dataset is not available, I used two files: 64KB of Lorem Ipsum (
lorem_ipsum_64k.txt
) and 64K emojis (252KB) generated with./emoji_gen.py 64000 > emojis.txt
. The Lorem Ipsum file and the script to generate random emoji sequences are available in the gist.Measurements
I compare 3 incremental versions of tiktoken: the original, the one with PCRE2 (commit 1), and the one with PCRE2 and my
byte_pair_merge
optimizations (commit 2).The original version of tiktoken fails to complete
emojis.txt
and I gave up on waiting for it after a few minutes (compared to 24 ms for the version in this PR).The following table shows the results in bytes per second as well as incremental speedups.
gpt2
lorem_ipsum_64k
r50k_base
lorem_ipsum_64k
gpt2
lorem_ipsum_64k
gpt2
emojis
r50k_base
emojis
With commit 1 reverted, the speedups are 15% for
lorem_ipsum_64k
andemojis
terminates with a whopping 13933 bytes per second throughput. The speedups of commit 2 are more evident when the core bottleneck of regex is dealt with first, but they're still important in isolation.Testing
Since most of the tests are not available either, I simply ran both the original and my version against a bunch of text and ensured the outputs were the same.
Discussion
I also briefly attempted using a heap, as suggested in the comments.
I quickly abandoned that approach because, as mentioned in the diff,
n = pieces.len()
is very small (usually less than 10) and as such it is more cache-efficient to just loop through a vector than it is to operate on a heap. This cache-efficiency vastly outweighs any algorithmic speedup.Also, it is interesting to note that different datasets spend a different ratio of time between regex lookups and
byte_pair_merge
, withlorem_ipsum_64k
.In the original code on
lorem_ipsum_64k
the regex andbyte_pair_merge
took a roughly equal amount of time according to perf.From the table above, it is evident that
emojis
is mostly bottlenecked on regex matching.