-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2) Optimize byte pair merge for small and big character sequences - 8.2s to 3.9s #76
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
8b244b6
Split out encodedToDecoded and SpecialEncoder from TokenEncoder
9a035e1
Optimize TokenEncoder
f73b823
Optimize countTokens
2b8efc9
Add addTokensAndGetCountLarge with linear byte pair merge
5b28725
Optimize decodeBytes
e98e30d
Remove cloning from ImmutableByteArray and rename it to ByteArrayWrapper
91c1430
Add ensureCapacity to calculateTokensSmall to minimize copies
fe46f89
Use simple HashMap instead of ConcurrentHashMap
17a83be
Iterate over all occurrences of the found minimum token in the TokenE…
78974d4
Move decodeToken to TokenEncoder
ad7292f
Add Cl100kLargeTokenizerTest to run every Cl100k tokenizer test with …
f95d368
Split Cl100kTest to LargeTokenizer version as well
5ee81a1
Fix Hungarian test prompt
03fcd3b
Don't remove same tokens during TokenEncoderLarge's identical tokens …
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -150,7 +150,7 @@ public static void main(String[] args) throws Exception { | |
"'s", "'t", "'re", "'ve", "'m", "'ll", "'d", "'x", | ||
"x", | ||
"123", | ||
"ő", | ||
"a", | ||
",.!?:; \n", | ||
"\n \n", | ||
" ", | ||
|
@@ -177,7 +177,7 @@ public static void main(String[] args) throws Exception { | |
} | ||
|
||
var totalSize = calculateTotalFileSize(rootFolder); | ||
if (totalSize != 99_945_290) { | ||
if (totalSize != 99_925_295) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reran the before/after with these data |
||
throw new AssertionError("Total size did not match expected value, actual: " + totalSize); | ||
} | ||
} | ||
|
11 changes: 11 additions & 0 deletions
11
benchmark/src/jmh/java/com/knuddels/jtokkit/SingleThreadedBenchmark.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
74 changes: 74 additions & 0 deletions
74
lib/src/main/java/com/knuddels/jtokkit/ByteArrayWrapper.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
package com.knuddels.jtokkit; | ||
|
||
import java.util.Arrays; | ||
|
||
class ByteArrayWrapper { | ||
private final byte[] array; | ||
|
||
/* | ||
* Creates a new instance of ByteArrayWrapper from the given array. | ||
* The given array is not copied, so every calling method in this class must make sure | ||
* to never pass an array which reference leaked to the outside. Since some of our | ||
* construction methods already create new arrays, we do not want to copy here in this | ||
* constructor again. | ||
*/ | ||
ByteArrayWrapper(byte[] array) { | ||
tox-p marked this conversation as resolved.
Show resolved
Hide resolved
|
||
this.array = array; | ||
} | ||
|
||
/** | ||
* Returns the length of this array. | ||
* | ||
* @return the length of this array. | ||
*/ | ||
public int length() { | ||
return array.length; | ||
} | ||
|
||
/** | ||
* Returns the bytes of this array from startIndex (inclusive) to endIndex (exclusive). The returned array is a copy | ||
* of the original array. | ||
* | ||
* @param startIndex the index from which to start copying (inclusive) | ||
* @param endIndex the index at which to stop copying (exclusive) | ||
* @return a new {@link ByteArrayWrapper} containing the bytes from startIndex (inclusive) to endIndex (exclusive) | ||
* @throws IllegalArgumentException if startIndex is out of bounds, endIndex is out of bounds or endIndex is less than | ||
* startIndex | ||
*/ | ||
public ByteArrayWrapper getBytesBetween(int startIndex, int endIndex) { | ||
if (startIndex < 0 || startIndex >= array.length) { | ||
throw new IndexOutOfBoundsException("startIndex out of bounds: " + startIndex + " (" + this + ")"); | ||
} else if (endIndex < 0 || endIndex > array.length) { | ||
throw new IndexOutOfBoundsException("endIndex out of bounds: " + endIndex + " (" + this + ")"); | ||
} else if (startIndex >= endIndex) { | ||
throw new IllegalArgumentException("startIndex must be less than endIndex: " + startIndex + " >= " + endIndex); | ||
} | ||
|
||
int length = endIndex - startIndex; | ||
byte[] result = new byte[length]; | ||
System.arraycopy(array, startIndex, result, 0, length); | ||
return new ByteArrayWrapper(result); | ||
} | ||
|
||
@Override | ||
public boolean equals(Object other) { | ||
if (this == other) { | ||
return true; | ||
} else if (other == null || getClass() != other.getClass()) { | ||
return false; | ||
} | ||
|
||
ByteArrayWrapper that = (ByteArrayWrapper) other; | ||
return Arrays.equals(array, that.array); | ||
} | ||
|
||
@Override | ||
public int hashCode() { | ||
return Arrays.hashCode(array); | ||
} | ||
|
||
@Override | ||
public String toString() { | ||
return Arrays.toString(array); | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing a slightly different setup here since
aaaaaa
is still a single token, so this makes it a more difficult case