-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RichText: internal value refactoring idea #29870
Comments
@ellatrix its exciting to see you bring your attention to this. RichText is awesome. Do we have any hard data making clear the performance implications for this? Is it a nice to have? A should have always had it? or a necessary escape from a current bottleneck? We’ve used the range tree on WordPress.com notifications without hassle but those are also very small documents that are essentially read-once. My experience is that range trees are convenient for building DOM and inconvenient for editing, as the burden lies on transforming an edited tree to prevent overlapping ranges and improperly nested elements. The current strategy is trivial in this regard because, by construction, we cannot have overlapping regions. Have you built any prototype to compare against? |
That's right. The current data structure is designed for making it easy to insert, split, remove etc. Actions where we need to reset the start and end values of the formats and objects. There's only two things that need to be fast in RichText though: inserting a character on input (typing) and rebuilding a tree from that.
We need to rebuild a tree on every keystroke, so this aspect is quite important for RichText performance, especially for larger amounts of texts like a long list (which is currently entirely handled by RichText).
No, it's just based on the above.
I haven't build anything so far, this is just me dumping my ideas for the record. |
What's a range tree? My proposal isn't a tree at all. We'd keep the flat text and have a Map that stores formats-range pairs. |
I'm not sure this is a great assumption. If the character falls at the boundary we might have to compute whether or not to extend or shift ranges. This might also be something we currently have to deal with anyway.
We might be doing something very inefficient with this. The current model works best at maintaining text models, but you're reminding me that we do everything on every change, otherwise we wouldn't be able to update the editor. Even if we make the change wouldn't we still be going full circle on every input event?
a Map that stores formats and range pair might end up identical to a de-optimized tree. if we're looking at formats with ranges… {
'em': [{49, 51}],
'strong': [{1, 3}, {48, 53}]
} …then we might find quickly-staggering complexity to sort them all out. I had assumed you were discussing an ordered set of ranges for in-editor updates since those can be somewhat optimized, especially for lookup. {'ranges': [
{1, 3, [], 'strong'},
{48, 53, [{49, 51, [], 'em'}], 'strong'}
]} are you saying we read from the DOM itself inside the RichText editor whenever running |
The structure I'm proposing stores ranges by format instance, not by format type.
If you enter a character at index 2, then all numbers after index 2 must be increased by 1. That seems simple enough. The same thing needs to happen at any index? Increase all numbers after the index, leave the rest alone? I'd have to make a quick prototype to see if there's any problems with this. I've checked the performance of current rich text, and it's fairly good, but that's all for small pieces of rich text which is what we have most of the time in Gutenberg. Actually, the performance of the block editor is many times worse than rich text. If you type in a random paragraph, the time it takes to process rich text is way smaller than the time it takes to process the update for the block editor. If we want to do any significant performance improvement, we should probably be looking at the data stores (cc @youknowriad). I suspect the proposed structure here will only really affect larger rich text values, such as a big list, and I purely made this issue from my desire to make rich text as good as it can be. :) So the priority here is not so high, but I think it would be fun to figure out what the best structure would be. |
One thing to be careful with this structure is overlapping regions. That's a big problem for HTML representations too. What does this mean? {
{ type: 'core/italic' }: [1, 8],
{ type: 'core/bold' }: [5, 13]
} At least at its face that this cannot map back to HTML without transforming back into a proper tree. |
It's possible to do that, but it needs to be split: We have the same issue with the current data structure, though it's handled in For the new structure, Sure, it's something extra to handle, but it doesn't seem too crazy. |
total agreement. though the question I have is whether after doing this we will still have the performance advantage over the current structure, since converting back to the tree is not a trivial operation. would love to see benchmarks |
We need to make sure we are very explicit in defining what string indexes count. Possibilities:
Let's apply a format to the female pilot; how do we get the range? JavaScript const s = ' 👩✈️ ';
// utf16 code units is how JS counts characters/strings
start = s.indexOf( '👩✈️' )
end = start + '👩✈️'.length
JSON = { text: ' 👩✈️ ', formats: { 'bold': [ [ 1, 6 ] ] } };
const toUTF8 = ( s, index ) => {
s = s.slice( 0, index )
s = encodeURIComponent( s )
s = s.replace( /%../g, '.' )
return s.length;
}
JSON = { text: ' 👩✈️ ', formats: { 'bold': [ [ 1, 14 ] ] } };
const toCodePointCounts = ( s, index ) => {
return [ ...( s.slice( 0, index ) ) ].length;
}
JSON = { text: ' 👩✈️ ', formats: { 'bold': [ [ 1, 5 ] ] } };
const toGraphemeClusterCounts = ( s, index ) => {
s = s.slice( 0, index )
return someLibrary.clusterCount( s )
}
JSON = { text: ' 👩✈️ ', formats: { 'bold': [ [ 1, 2 ] ] } }; |
I think you've managed to solve this right @ellatrix what do you think is left here? |
@youknowriad Actually no, this is still open. I simply haven't had the time to finish the PR, it's not been a priority. I do want to get to it when I find some time. |
Currently, RichText's internal value has separated text from formatting, which is great for searching the text and easily applying formatting. The formatting data is a sparse array that is filled at the positions where the text is respectively formatted. For example, if text is formatted from position 2-5 in bold and from position 2-10 with a link, then you'll find arrays in the sparse array from position 2-10 that includes the bold format from 2-5 and the link format from 2-10.
This format is great for manipulating the data, but not so great for serialising it back to a tree format (DOM or HTML). In order to do so, you need to go over each position one by one and build up a tree based on what formats you have.
How can we make that better?
One way could be to store formats with ranges in a Map, for example
{ type: 'bold' }: [ 2, 5 ]
. We can then make a custom iterator that gives us the formats in the right order for a tree: ordered by range start and then the length of the range.Now we no longer have to go over each position to build up a tree. We can add slices of text in one go as we go over formats in order.
Format APIs should also be straightforward to implement.
applyFormat( value, format, start, end )
is just a matter of adding the format to the Map. Forslice
it would mean splitting the intersecting formats at a certain position. Some operations become easier, some a bit harder, but that's ok. The area where we should optimise for speed is converting to/from a tree, not the value operations, which run significantly less often.Thoughts?
The text was updated successfully, but these errors were encountered: