Piece Table isn't a very common data structure and therefore isn't well documented.
Here are a few helpful article that inspired me:
from Wikipedia:
A piece table is a data structure typically used to represent a series of edits on a text document. An initial reference (or 'span') to the whole of the original file is created, with subsequent inserts and deletes being created as combinations of one, two, or three references to sections of either the original document or of the spans associated with earlier inserts.
Typically the text of the original document is held in one immutable block, and the text of each subsequent insert is stored in new immutable blocks. Because even deleted text is still included in the piece table, this makes multi-level or unlimited undo easier to implement with a piece table than with alternative data structures such as a gap buffer.
I was curious about the implementation details of such a data structure, since I myself work on the text editor of Visual Studio at Microsoft. However, existing implementation available on GitHub, including the one of Visual Studio itself, are either:
-
Complex to read because tightly linked to a product like VS or VS Code, and not generic enough to be reused as is.
-
Incomplete and/or buggy.
-
Not documented / commented.
-
In functional or procedural language (nothing against it but I like oriented object programming).
So I thought let's have fun and do it myself, hoping to have something reliable, generic and easy to understand
.
If you take 10 minutes to read The Piece Table, by Darren Burns and understand well how the Piece Table is supposed to work, then reading the code in this repository should be relatively easy as it reuses some terms you can find in this blog article.
-
TextDocumentBuffer
hold the buffers that allow to represent and rebuild the text as astring
after an insertion / deletion in the text document. -
TextPieceTable
hosts thepiece table
itself and the logic for inserting / deleting a piece. It has a bunch of logic for translating coordinates from the text document to a piece in the buffer. I uses aLinkedList<Piece>
for representing the list of pieces. -
Piece
represents a piece in the piece table. It has the following information:-
Whether this piece is from the original or append buffer.
-
A
Span
representing where the piece is in the buffer.
-
-
Span
represents a range in a buffer as well in the text document with a start position and length.
A unit test project validates the good behavior of the implementation. Hopefully it's covering enough scenarios to make it reliable.
A benchmark project helped me at identifying some flaws in the implementation and improve it.
To run the benchmark, simply do in a PowerShell command prompt:
> cd Benchmarks
> dotnet run -c Release
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.101
[Host] : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT
Job=InProcess Toolchain=InProcessEmitToolchain InvocationCount=1
UnrollFactor=1
Method | Initial amount of pieces in the piece table | Mean | Median | Allocated |
---|---|---|---|---|
Insertion near the end of the document | 1000 | 2.126 μs | 1.900 μs | 960 B |
Insertion near the beginning of the document | 1000 | 1.857 μs | 1.600 μs | 480 B |
Insertion in the middle of the document | 1000 | 2.621 μs | 2.500 μs | 704 B |
Insertion near the end of the document | 10000 | 1.397 μs | 1.300 μs | 816 B |
Insertion near the beginning of the document | 10000 | 1.338 μs | 1.200 μs | 816 B |
Insertion in the middle of the document | 10000 | 10.286 μs | 10.300 μs | 704 B |
Insertion near the end of the document | 100000 | 3.839 μs | 3.800 μs | 816 B |
Insertion near the beginning of the document | 100000 | 4.171 μs | 3.950 μs | 1,104 B |
Insertion in the middle of the document | 100000 | 120.888 μs | 120.850 μs | 704 B |
Insertion near the end of the document | 1000000 | 5.581 μs | 5.500 μs | 816 B |
Insertion near the beginning of the document | 1000000 | 6.025 μs | 5.900 μs | 528 B |
Insertion in the middle of the document | 1000000 | 3,145.587 μs | 3,106.300 μs | 992 B |
Delete near the end of the document | 1000 | 1.851 μs | 1.800 μs | 1,248 B |
Delete near the beginning of the document | 1000 | 2.038 μs | 1.900 μs | 1,248 B |
Delete in the middle of the document | 1000 | 3.008 μs | 2.800 μs | 2,296 B |
Delete near the end of the document | 10000 | 1.444 μs | 1.300 μs | 1,584 B |
Delete near the beginning of the document | 10000 | 1.413 μs | 1.300 μs | 1,248 B |
Delete in the middle of the document | 10000 | 11.212 μs | 11.200 μs | 1,200 B |
Delete near the end of the document | 100000 | 5.727 μs | 5.700 μs | 912 B |
Delete near the beginning of the document | 100000 | 6.146 μs | 6.100 μs | 912 B |
Delete in the middle of the document | 100000 | 110.200 μs | 109.400 μs | 1,200 B |
Delete near the end of the document | 1000000 | 8.709 μs | 8.450 μs | 1,872 B |
Delete near the beginning of the document | 1000000 | 8.664 μs | 8.300 μs | 960 B |
Delete in the middle of the document | 1000000 | 2,887.207 μs | 2,852.200 μs | 624 B |
Get some text near the end of the document | 1000 | 3,467.0 ns | 2,400.0 ns | 1,080 B |
Get some text near the beginning of the document | 1000 | 2,008.1 ns | 1,800.0 ns | 1,416 B |
Get some text in the middle of the document | 1000 | 3,249.5 ns | 3,300.0 ns | 1,408 B |
Get a character near the end of the document | 1000 | 595.7 ns | 600.0 ns | - |
Get a character near the beginning of the document | 1000 | 390.9 ns | 400.0 ns | - |
Get a character in the middle of the document | 1000 | 1,563.5 ns | 1,600.0 ns | 672 B |
Get all the text in the document | 1000 | 28,003.2 ns | 27,500.0 ns | 61,424 B |
Get some text near the end of the document | 10000 | 1,840.4 ns | 1,700.0 ns | 1,472 B |
Get some text near the beginning of the document | 10000 | 1,944.9 ns | 1,800.0 ns | 792 B |
Get some text in the middle of the document | 10000 | 12,395.6 ns | 12,000.0 ns | 1,408 B |
Get a character near the end of the document | 10000 | 344.7 ns | 300.0 ns | - |
Get a character near the beginning of the document | 10000 | 401.0 ns | 400.0 ns | 624 B |
Get a character in the middle of the document | 10000 | 10,010.0 ns | 9,900.0 ns | 672 B |
Get all the text in the document | 10000 | 226,790.3 ns | 225,700.0 ns | 739,568 B |
Get some text near the end of the document | 100000 | 7,812.1 ns | 7,700.0 ns | 840 B |
Get some text near the beginning of the document | 100000 | 8,168.5 ns | 8,100.0 ns | 840 B |
Get some text in the middle of the document | 100000 | 114,848.5 ns | 112,700.0 ns | 1,120 B |
Get a character near the end of the document | 100000 | 1,102.1 ns | 1,100.0 ns | - |
Get a character near the beginning of the document | 100000 | 1,732.0 ns | 1,700.0 ns | - |
Get a character in the middle of the document | 100000 | 105,804.0 ns | 103,400.0 ns | 960 B |
Get all the text in the document | 100000 | 3,166,211.5 ns | 3,144,600.0 ns | 6,209,184 B |
Get some text near the end of the document | 1000000 | 11,806.2 ns | 10,950.0 ns | 1,128 B |
Get some text near the beginning of the document | 1000000 | 11,375.3 ns | 11,000.0 ns | 1,080 B |
Get some text in the middle of the document | 1000000 | 2,890,640.2 ns | 2,841,000.0 ns | 832 B |
Get a character near the end of the document | 1000000 | 2,253.3 ns | 2,100.0 ns | 576 B |
Get a character near the beginning of the document | 1000000 | 2,310.5 ns | 2,300.0 ns | 336 B |
Get a character in the middle of the document | 1000000 | 2,763,792.9 ns | 2,756,250.0 ns | 672 B |
Get all the text in the document | 1000000 | 37,957,515.0 ns | 38,045,400.0 ns | 53,602,496 B |
That's because we're reading the table of pieces (which is a LinkedList<Piece>
) sequentially to find where does the given span / text should be inserted / deleted. Based on the location in the text document where we want to do the insertion / deletion, the program decides whether it should navigate forward (from the beginning to the end) or backward (from the end to the beginning) in the table of pieces.
Since the table of pieces is a LinkedList
, we have to read it sequentially. When we're trying to access to the middle of the it, reading forward or backward like explained above doesn't help. We have to pay the cost of going through more items to reach the middle.
A potential way to improve it is to transform the piece table into a "piece-tree" that would basically represents the pieces in a hierarchical manner where each node also give us some information on the state of the text document. Pretty much what's explained here: https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation#_boost-line-lookup-by-using-a-balanced-binary-tree
This approach should allow to perform a binary search.
Sure, in C#, reading the list would be faster since we can access to any item through the indexer like myList[i]
and could perform binary search. However, assuming this data structure is used in the context of a text editor, as a user, I can typing anywhere in the document, which would require to do a List.Insert
instead of List.Add
. Unfortunately, List.Insert
is very slow on a large list and internally does a copy of the list. The impact of it is that inserting and deleting in the piece table would be much slower. Assuming again that this data structure is used in the context of a text editor, this means that the typing responsiveness (how fast does the text editor answers to user type) would likely be impacted.
A good StackOverlow answer about performance difference between LinkedList and List: c# - When should I use a List vs a LinkedList - Stack Overflow
With all that said, it might worth experimenting with other collection types like ImmutableList
for example. But at the end of the day, a true performance improvement, in the context of a text editor, is not to do a piece table, but a "piece tree" using, for example, an AVL Tree or Red-Black Tree.
Found an optimization to do, a bug or a scenario not covered by unit tests? Feel free to open an issue or a pull request.