-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiline text blocks #4629
Multiline text blocks #4629
Conversation
Isn't this somewhat redundant, given the current Edit: As a general rule, please strict equalities/inequalities in new code. |
You're right, overlooked textLayer=visible, though debugTextLayer shows a bit more, mainly things i needed to check that the code works as intended. |
Personally I would prefer if you modified (or added to) the current Also, the file font_metrics.js is missing a license header (and editor mode lines). |
No problem, i can do that, will remove the debugTextLayer flag. |
/botio-linux preview |
From: Bot.io (Linux)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://107.21.233.14:8877/d1e01dc257d7d00/output.txt |
From: Bot.io (Linux)SuccessFull output at http://107.21.233.14:8877/d1e01dc257d7d00/output.txt Total script time: 0.38 mins Published
|
The preview does not work:
|
Ok, will fix it. |
Adds a feature flag to support multi-line text layer. The main purpose is to make text selection a bit more natural and to keep the DOM as leightweight as possible by using as much as possible inline elements.
Was just working on something similar and was about to PR it until I saw this. Mine did text coalescing on chars to sentences. One particular problem I faced was a space for justified text doesn't always match up. It's quite evident in this file: http://www.music.mcgill.ca/~cmckay/papers/musictech/ISMIR_2006_Codaich.pdf |
Yes, i can see that too, very obvious also in the PDF 1.7 spec. AFAICT this is caused by the fact that the wordSpacing value is not taken into account currently in buildTextGeometry. I'm looking into these details now actually. |
There was a very recent (yesterday) PR, #4633, that fixed some spacing issues in the textLayer. Could that be relevant for this issue? |
Not really relevant for this issue here, but it is related. |
Guys, here's my code that deals with the justified text problem. In short, I interpolate the spacing between two characters with a space in the middle and set the width of the space. I choose to use |
Also noticed another bug. Unfortunately, Safari (both iOS and OSX v 7.0) does not support fractional letter-spacing. https://bugs.webkit.org/show_bug.cgi?id=20606 Though the bug is fixed, it won't be fixed until a newer version of Safari is released. |
My comment from #4843: Interesting, this probably improves performance a bit, but does not really address the issue what happens when you move the mouse outside the multi-line block? And does it address the issue of text segments being in the wrong order? Just combining them in the order provided by pdf.js does not seem to work in all cases. |
My idea was to start with this from scratch. I see three main areas to be improved:
The performance issues are mainly caused by to expensive text layers in terms of DOM. Basically for some types of PDFs each single character ends up in its own composite layer (in its own scaled absolute positioned element). This makes e.g. scrolling of these documents very slow. There are IMO two main issues to improve matching of the text, ensure that really the correct font family gets selected (the PDFs are often lying about the font family, i did some tests with xor'ing the target font with all main font families, that turnes out to work pretty well). The other issue is word spacing. Currently the word spacing is not exposed to the text layer (there is some code which tries to address that with inserting spaces between text fragments if the white space is bigger than an according white space, but that covers only limited set of PDFs, e.g. not the ones which define word spacing instead). Exposing it makes adjustments in the text layer a bit more complicated, but it should IMO be possible. @Hengjie code mainly tries to improve the text matching. UX issue, like flickering while selecting whole text blocks, are caused by the current DOM structure, e.g. an elemnt of the text layer contains at most only a single line. This PR here addresses all issues to some extend. Most of the changes are in the text layer. But i think it would make more sense to move most of that code down to buildTextGeometry, basically move that call to it's own class and add to the according geometry objects also line numbers and ideally text block numbers. There are four PS instructions which call buildTextGeometry to create text geometry objects, OPS.showSpacedText, OPS.showText, OPS.nextLineShowText, OPS.nextLineSetSpacingShowText. Right now only OPS.showSpacedText accumulates text fragments in a geometry object, but at least OPS.showText should do that too (this is the main reason for the performance issues). But the main issue is that i don't really have time anymore to work on this. I can probably update this PR here to the newest mainline, but i have no time to do the changes which i have had in mind. Perhaps discuss with @yurydelendik if that makes sense or if you would like to restart from scratch with these improvements? |
I addressed this in #4843. If you make sure that text segments are in correct order of text flow, then flickering is fixed, I believe. Try it here: https://peerlibrary.org/p/tfgbf3kfxbQ4ubunj/trace-based-just-in-time-type-specialization-for-dynamic-languages |
@mitar That appears to work much smoother indeed, but I do notice that, while text selection is working, right-click and Copy does not work. Right-clicking undoes the selection and displays the regular context menu without Copy option. |
@timvandermeij: This is some our text annotation code interfering I think. I will look into it, but I do not see that it is related to the text selection process itself. |
@mitar Oh, it's not, I just thought I'd mention it for your library (which looks nice by the way!) :-) |
Sorry for the late response. Yes, I'd be more than happy to take over the task of coalesing the texts. @chriskr do you know where the word spacing information is? How can I extract it? The way I'm doing it right now is inferring that information based on the letter's absolute positions and the maximum char width of the font. It's a hacky solution and I'd prefer to remove all of that interpolation code. The biggest problem I have with the code is that it completely breaks search highlighting. I'm unsure how to fix it and will probably need help as I'm unfamiliar with the rest of the PDF.js code. I've also done a lot of work in fixing up selecting text and how dragging a pixel beyond the text will result in all text being selected. |
Word spacing is exposed on TextState as wordSpacing, see core/evaluator.js. I guess you will have to dig into that file a bit. |
@Hengjie Thank you! If you have a prototype, then we can look into why it breaks search highlighting. Fortunately, the search highlighting code isn't that complicated from what I've seen from it, so we can always help you look into that. |
Yup, my code is kinda old, I'll polish it up with the latest master HEAD before I have a prototype. |
Just wanted to update this issue, I'm in the middle of merging master into my branch. |
Wanted to chime in with a pdf that has severe performance issues on selection: Don't know if it helps, or if it's even too relevant. If it's not, I'll file a new bug. |
@jleclanche I can confirm that, but please open a separate issue for that. |
Opened #5107 |
Guys, what do you think about doing the coalescing in getTextContent() in evaluator.js instead? I feel like this will resolve the exposing data to text_layer_builder.js problem that @chriskr described but also remove the absolute positioning data I'm inferring from the DOM in my code (for performance). I'm not familiar with the evaluator.js but is it run in the worker? Edit: Looks like it's from worker. A note about the flickering, the issue with ordering is that sometimes PDFs would order it different and we'd have to reorder it. It also depends on the language, such ltr or rtl text. I hacked up some code that's running in production that does the coalescing and fix flickering to an extent at https://web.notablepdf.com that you guys can try out. It's not perfect, really just hacky and not in a state that I think is PR-able. For eg, search highlighting depends on rangy-highlighter to work. |
@Hengjie yes, that's the indented location for that. We already running some heuristics there to minimize amount of divs. |
Is anyone working on this? I was wondering if it's possible to try this feature with the conflicts. |
@Hengjie was the last one working on this if I am correct. |
I've started working on this again and have coalescing code working in the worker's The performance of doing this is far better than @chriskr's approach because it doesn't rely on the DOM to get positioning information. |
@chriskr re Right now, I just fallback to the original PDF.js method in
|
I guess you would have to define a minimal font size to measure the ascent and scale the font accordingly before measuring it if it is smaller. That could probably be done in fontmetrics.js. |
@chriskr thanks. Got it working. |
Any updates on this pull request by any chance? |
Support for multi-line text blocks
Adds a feature flag to enable multi-line text blocks (disableMultilineTextLayer=false), disabled by default.
The purpose is to make the DOM lighter and the text selection more natural, e.g. less flickering on text selection. Additionally added a flag to make the text layer visible (debugTextLayer). Mainly checked the code with
Especially with the last two documents the scroll performance has improved remarkably, mainly due to concatenating of chars to words.
Also changed the highlight color to system color as a suggestion.