Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline text blocks #4629

Closed
wants to merge 2 commits into from
Closed

Conversation

chriskr
Copy link
Contributor

@chriskr chriskr commented Apr 15, 2014

Support for multi-line text blocks

Adds a feature flag to enable multi-line text blocks (disableMultilineTextLayer=false), disabled by default.
The purpose is to make the DOM lighter and the text selection more natural, e.g. less flickering on text selection. Additionally added a flag to make the text layer visible (debugTextLayer). Mainly checked the code with

Especially with the last two documents the scroll performance has improved remarkably, mainly due to concatenating of chars to words.

Also changed the highlight color to system color as a suggestion.

@Snuffleupagus
Copy link
Collaborator

Additionally added a flag to make the text layer visible (debugTextLayer)

Isn't this somewhat redundant, given the current textLayer hash parameter, see: https://github.com/mozilla/pdf.js/wiki/Debugging-pdf.js#url-parameters.

Edit: As a general rule, please strict equalities/inequalities in new code.

@chriskr
Copy link
Contributor Author

chriskr commented Apr 15, 2014

You're right, overlooked textLayer=visible, though debugTextLayer shows a bit more, mainly things i needed to check that the code works as intended.

@Snuffleupagus
Copy link
Collaborator

though debugTextLayer shows a bit more

Personally I would prefer if you modified (or added to) the current textLayer debugging, instead of basically duplicating existing code. @yurydelendik What do you think?

Also, the file font_metrics.js is missing a license header (and editor mode lines).

@chriskr
Copy link
Contributor Author

chriskr commented Apr 15, 2014

No problem, i can do that, will remove the debugTextLayer flag.

@yurydelendik yurydelendik self-assigned this Apr 15, 2014
@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/d1e01dc257d7d00/output.txt

@timvandermeij
Copy link
Contributor

The preview does not work:

ReferenceError: FontMetrics is not defined

@chriskr
Copy link
Contributor Author

chriskr commented Apr 15, 2014

Ok, will fix it.

Christian Krebs added 2 commits April 15, 2014 22:36
Adds a feature flag to support multi-line text layer. The main purpose is to
make text selection a bit more natural and to keep the DOM as leightweight
as possible by using as much as possible inline elements.
@Hengjie
Copy link
Contributor

Hengjie commented Apr 17, 2014

Was just working on something similar and was about to PR it until I saw this. Mine did text coalescing on chars to sentences. One particular problem I faced was a space for justified text doesn't always match up. It's quite evident in this file: http://www.music.mcgill.ca/~cmckay/papers/musictech/ISMIR_2006_Codaich.pdf

@chriskr
Copy link
Contributor Author

chriskr commented Apr 17, 2014

justified text doesn't always match up

Yes, i can see that too, very obvious also in the PDF 1.7 spec. AFAICT this is caused by the fact that the wordSpacing value is not taken into account currently in buildTextGeometry. I'm looking into these details now actually.

@Snuffleupagus
Copy link
Collaborator

AFAICT this is caused by the fact that the wordSpacing value is not taken into account currently in buildTextGeometry.

There was a very recent (yesterday) PR, #4633, that fixed some spacing issues in the textLayer. Could that be relevant for this issue?

@chriskr
Copy link
Contributor Author

chriskr commented Apr 17, 2014

Could that be relevant for this issue?

Not really relevant for this issue here, but it is related.

@Hengjie
Copy link
Contributor

Hengjie commented Apr 18, 2014

Guys, here's my code that deals with the justified text problem. In short, I interpolate the spacing between two characters with a space in the middle and set the width of the space. I choose to use width instead of letter-spacing because of other selection issues. Most of the functionality is fairly self contained into a this.coalesce() method. This is basically a drop in replacement to text_layer_builder.js. If there is interest, I'd be happy to tidy it up and PR it, but I'd rather that we work together to make text coalescing better in this PR.

https://gist.github.com/Hengjie/11019860

@Hengjie
Copy link
Contributor

Hengjie commented Apr 28, 2014

Also noticed another bug. Unfortunately, Safari (both iOS and OSX v 7.0) does not support fractional letter-spacing. https://bugs.webkit.org/show_bug.cgi?id=20606 Though the bug is fixed, it won't be fixed until a newer version of Safari is released.

@mitar
Copy link
Contributor

mitar commented May 26, 2014

My comment from #4843:

Interesting, this probably improves performance a bit, but does not really address the issue what happens when you move the mouse outside the multi-line block? And does it address the issue of text segments being in the wrong order? Just combining them in the order provided by pdf.js does not seem to work in all cases.

@timvandermeij
Copy link
Contributor

Let's get this PR rolling again. @chriskr Would it be useful to use some of the code from @Hengjie? Because text selection is still a bit of a bottleneck for PDF.js, there is definitely interest to improve its performance.

@chriskr
Copy link
Contributor Author

chriskr commented Jun 18, 2014

My idea was to start with this from scratch. I see three main areas to be improved:

  • performance
  • matching of the text layer and the canvas text
  • UX, e.g. less flickering on selecting bigger text blocks

The performance issues are mainly caused by to expensive text layers in terms of DOM. Basically for some types of PDFs each single character ends up in its own composite layer (in its own scaled absolute positioned element). This makes e.g. scrolling of these documents very slow.

There are IMO two main issues to improve matching of the text, ensure that really the correct font family gets selected (the PDFs are often lying about the font family, i did some tests with xor'ing the target font with all main font families, that turnes out to work pretty well). The other issue is word spacing. Currently the word spacing is not exposed to the text layer (there is some code which tries to address that with inserting spaces between text fragments if the white space is bigger than an according white space, but that covers only limited set of PDFs, e.g. not the ones which define word spacing instead). Exposing it makes adjustments in the text layer a bit more complicated, but it should IMO be possible. @Hengjie code mainly tries to improve the text matching.

UX issue, like flickering while selecting whole text blocks, are caused by the current DOM structure, e.g. an elemnt of the text layer contains at most only a single line.

This PR here addresses all issues to some extend. Most of the changes are in the text layer. But i think it would make more sense to move most of that code down to buildTextGeometry, basically move that call to it's own class and add to the according geometry objects also line numbers and ideally text block numbers. There are four PS instructions which call buildTextGeometry to create text geometry objects, OPS.showSpacedText, OPS.showText, OPS.nextLineShowText, OPS.nextLineSetSpacingShowText. Right now only OPS.showSpacedText accumulates text fragments in a geometry object, but at least OPS.showText should do that too (this is the main reason for the performance issues).

But the main issue is that i don't really have time anymore to work on this. I can probably update this PR here to the newest mainline, but i have no time to do the changes which i have had in mind. Perhaps discuss with @yurydelendik if that makes sense or if you would like to restart from scratch with these improvements?

@mitar
Copy link
Contributor

mitar commented Jun 20, 2014

UX issue, like flickering while selecting whole text blocks, are caused by the current DOM structure, e.g. an elemnt of the text layer contains at most only a single line.

I addressed this in #4843. If you make sure that text segments are in correct order of text flow, then flickering is fixed, I believe. Try it here: https://peerlibrary.org/p/tfgbf3kfxbQ4ubunj/trace-based-just-in-time-type-specialization-for-dynamic-languages

@timvandermeij
Copy link
Contributor

@mitar That appears to work much smoother indeed, but I do notice that, while text selection is working, right-click and Copy does not work. Right-clicking undoes the selection and displays the regular context menu without Copy option.

@mitar
Copy link
Contributor

mitar commented Jun 20, 2014

@timvandermeij: This is some our text annotation code interfering I think. I will look into it, but I do not see that it is related to the text selection process itself.

@timvandermeij
Copy link
Contributor

@mitar Oh, it's not, I just thought I'd mention it for your library (which looks nice by the way!) :-)

@timvandermeij
Copy link
Contributor

@Hengjie Are you interested in continuing with your solution, taking the tips from @chriskr and @mitar into account?

@Hengjie
Copy link
Contributor

Hengjie commented Jul 15, 2014

Sorry for the late response. Yes, I'd be more than happy to take over the task of coalesing the texts.

@chriskr do you know where the word spacing information is? How can I extract it? The way I'm doing it right now is inferring that information based on the letter's absolute positions and the maximum char width of the font. It's a hacky solution and I'd prefer to remove all of that interpolation code.

The biggest problem I have with the code is that it completely breaks search highlighting. I'm unsure how to fix it and will probably need help as I'm unfamiliar with the rest of the PDF.js code.

I've also done a lot of work in fixing up selecting text and how dragging a pixel beyond the text will result in all text being selected.

@chriskr
Copy link
Contributor Author

chriskr commented Jul 15, 2014

Word spacing is exposed on TextState as wordSpacing, see core/evaluator.js. I guess you will have to dig into that file a bit.

@timvandermeij
Copy link
Contributor

@Hengjie Thank you! If you have a prototype, then we can look into why it breaks search highlighting. Fortunately, the search highlighting code isn't that complicated from what I've seen from it, so we can always help you look into that.

@Hengjie
Copy link
Contributor

Hengjie commented Jul 16, 2014

Yup, my code is kinda old, I'll polish it up with the latest master HEAD before I have a prototype.

@Hengjie
Copy link
Contributor

Hengjie commented Jul 30, 2014

Just wanted to update this issue, I'm in the middle of merging master into my branch.

@jleclanche
Copy link

Wanted to chime in with a pdf that has severe performance issues on selection:
http://www.meghnacambridge.co.uk/menu.pdf

Don't know if it helps, or if it's even too relevant. If it's not, I'll file a new bug.

@timvandermeij
Copy link
Contributor

@jleclanche I can confirm that, but please open a separate issue for that.

@jleclanche
Copy link

Opened #5107

@Hengjie
Copy link
Contributor

Hengjie commented Aug 13, 2014

Guys, what do you think about doing the coalescing in getTextContent() in evaluator.js instead? I feel like this will resolve the exposing data to text_layer_builder.js problem that @chriskr described but also remove the absolute positioning data I'm inferring from the DOM in my code (for performance).

I'm not familiar with the evaluator.js but is it run in the worker?

Edit: Looks like it's from worker.

A note about the flickering, the issue with ordering is that sometimes PDFs would order it different and we'd have to reorder it. It also depends on the language, such ltr or rtl text.

I hacked up some code that's running in production that does the coalescing and fix flickering to an extent at https://web.notablepdf.com that you guys can try out. It's not perfect, really just hacky and not in a state that I think is PR-able. For eg, search highlighting depends on rangy-highlighter to work.

@yurydelendik
Copy link
Contributor

Guys, what do you think about doing the coalescing in getTextContent() in evaluator.js instead?

@Hengjie yes, that's the indented location for that. We already running some heuristics there to minimize amount of divs.

@fjaguero
Copy link

Is anyone working on this? I was wondering if it's possible to try this feature with the conflicts.

@timvandermeij
Copy link
Contributor

@Hengjie was the last one working on this if I am correct.

@Hengjie
Copy link
Contributor

Hengjie commented Dec 25, 2014

I've started working on this again and have coalescing code working in the worker's evaluator.js. I'm currently only coalescing words, but eventually want to coalesce to sentences and PR it. Longer term, the idea is to make it into a single div, so that we're coalescing paragraphs and columns together into a to reduce selection flickering further.

The performance of doing this is far better than @chriskr's approach because it doesn't rely on the DOM to get positioning information.

@Hengjie
Copy link
Contributor

Hengjie commented Dec 25, 2014

Btw, I've also written my own text selection padding algorithm but it looks like #5545 is far more sophisticated. I would prefer to wait until #5545 is merged since coalescing and padding are both complementary to better text selection goals.

@Hengjie
Copy link
Contributor

Hengjie commented Jan 3, 2015

@chriskr re font_metrics.js, the ascent seems to become 0 when zoom drops below 50% for the compressed.tracemonkey-pldi-09.pdf. The reason why is because the fontHeight is so small that getFirstNoneWhitePixelTop()'s xy iterations aren't granular enough. I haven't looked further than that, but I'd really prefer to use your font_metrics for the coalescing code I'm writing. Is there a solution to this problem?

Right now, I just fallback to the original PDF.js method in text_layer_builder.js's appendText(). See below:

      // More accurate ascent
      var font = fontHeight.toFixed(3) + 'px ' + style.fontFamily;
      var fontAscent = this.fontMetrics.getFontAscent(font, geom.str[0], style);

      // Fallback to less accurate since getFontAscent() can't get below a certain percision
      if (fontAscent === 0) {
        if (style.ascent) {
          fontAscent = style.ascent * fontHeight;
        } else if (style.descent) {
          fontAscent = (1 + style.descent) * fontHeight;
        }
      }

@chriskr
Copy link
Contributor Author

chriskr commented Jan 3, 2015

I guess you would have to define a minimal font size to measure the ascent and scale the font accordingly before measuring it if it is smaller. That could probably be done in fontmetrics.js.

@Hengjie
Copy link
Contributor

Hengjie commented Jan 3, 2015

@chriskr thanks. Got it working.

@Hengjie Hengjie mentioned this pull request Jan 4, 2015
@ziahamza
Copy link

Any updates on this pull request by any chance?

@timvandermeij
Copy link
Contributor

This PR is superseded by #6590 and #6591 (and possibly others). Since this PR is stale and its functionality is now incorporated in PDF.js by the mentioned PRs, I'm closing this PR. If there are more improvements to make for the current text layer code, please open a new issue or PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants