Multiline text blocks #4629

chriskr · 2014-04-15T14:12:59Z

Support for multi-line text blocks

Adds a feature flag to enable multi-line text blocks (disableMultilineTextLayer=false), disabled by default.
The purpose is to make the DOM lighter and the text selection more natural, e.g. less flickering on text selection. Additionally added a flag to make the text layer visible (debugTextLayer). Mainly checked the code with

the Trace Monkey document
the NASA document (Poor performance on NASA Budget #4339)
the PDF 1.7 spec
some Russian document (Poor performance on a pdf document #4626)
and some French document (pdf.js worst case: openclassrooms pdfs #4487)

Especially with the last two documents the scroll performance has improved remarkably, mainly due to concatenating of chars to words.

Also changed the highlight color to system color as a suggestion.

Snuffleupagus · 2014-04-15T14:55:29Z

Additionally added a flag to make the text layer visible (debugTextLayer)

Isn't this somewhat redundant, given the current textLayer hash parameter, see: https://github.com/mozilla/pdf.js/wiki/Debugging-pdf.js#url-parameters.

Edit: As a general rule, please strict equalities/inequalities in new code.

chriskr · 2014-04-15T15:20:49Z

You're right, overlooked textLayer=visible, though debugTextLayer shows a bit more, mainly things i needed to check that the code works as intended.

Snuffleupagus · 2014-04-15T15:27:46Z

though debugTextLayer shows a bit more

Personally I would prefer if you modified (or added to) the current textLayer debugging, instead of basically duplicating existing code. @yurydelendik What do you think?

Also, the file font_metrics.js is missing a license header (and editor mode lines).

chriskr · 2014-04-15T15:37:03Z

No problem, i can do that, will remove the debugTextLayer flag.

timvandermeij · 2014-04-15T16:43:47Z

/botio-linux preview

pdfjsbot · 2014-04-15T16:43:48Z

From: Bot.io (Linux)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/d1e01dc257d7d00/output.txt

pdfjsbot · 2014-04-15T16:44:11Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/d1e01dc257d7d00/output.txt

Total script time: 0.38 mins

Published

Viewer: http://107.21.233.14:8877/d1e01dc257d7d00/web/viewer.html
B2G Viewer: http://107.21.233.14:8877/d1e01dc257d7d00/extensions/b2g/content/web/viewer.html
Extension: http://107.21.233.14:8877/d1e01dc257d7d00/extensions/firefox/pdf.js.xpi
Extension (AMO): http://107.21.233.14:8877/d1e01dc257d7d00/extensions/firefox/pdf.js.amo.xpi

timvandermeij · 2014-04-15T16:45:00Z

The preview does not work:

ReferenceError: FontMetrics is not defined

chriskr · 2014-04-15T18:02:07Z

Ok, will fix it.

Adds a feature flag to support multi-line text layer. The main purpose is to make text selection a bit more natural and to keep the DOM as leightweight as possible by using as much as possible inline elements.

Hengjie · 2014-04-17T14:30:00Z

Was just working on something similar and was about to PR it until I saw this. Mine did text coalescing on chars to sentences. One particular problem I faced was a space for justified text doesn't always match up. It's quite evident in this file: http://www.music.mcgill.ca/~cmckay/papers/musictech/ISMIR_2006_Codaich.pdf

chriskr · 2014-04-17T14:51:47Z

justified text doesn't always match up

Yes, i can see that too, very obvious also in the PDF 1.7 spec. AFAICT this is caused by the fact that the wordSpacing value is not taken into account currently in buildTextGeometry. I'm looking into these details now actually.

Snuffleupagus · 2014-04-17T15:10:25Z

AFAICT this is caused by the fact that the wordSpacing value is not taken into account currently in buildTextGeometry.

There was a very recent (yesterday) PR, #4633, that fixed some spacing issues in the textLayer. Could that be relevant for this issue?

chriskr · 2014-04-17T15:50:44Z

Could that be relevant for this issue?

Not really relevant for this issue here, but it is related.

Hengjie · 2014-04-18T01:12:28Z

Guys, here's my code that deals with the justified text problem. In short, I interpolate the spacing between two characters with a space in the middle and set the width of the space. I choose to use width instead of letter-spacing because of other selection issues. Most of the functionality is fairly self contained into a this.coalesce() method. This is basically a drop in replacement to text_layer_builder.js. If there is interest, I'd be happy to tidy it up and PR it, but I'd rather that we work together to make text coalescing better in this PR.

https://gist.github.com/Hengjie/11019860

Hengjie · 2014-04-28T05:15:11Z

Also noticed another bug. Unfortunately, Safari (both iOS and OSX v 7.0) does not support fractional letter-spacing. https://bugs.webkit.org/show_bug.cgi?id=20606 Though the bug is fixed, it won't be fixed until a newer version of Safari is released.

mitar · 2014-05-26T22:02:29Z

My comment from #4843:

Interesting, this probably improves performance a bit, but does not really address the issue what happens when you move the mouse outside the multi-line block? And does it address the issue of text segments being in the wrong order? Just combining them in the order provided by pdf.js does not seem to work in all cases.

timvandermeij · 2014-06-17T21:53:40Z

Let's get this PR rolling again. @chriskr Would it be useful to use some of the code from @Hengjie? Because text selection is still a bit of a bottleneck for PDF.js, there is definitely interest to improve its performance.

chriskr · 2014-06-18T14:44:48Z

My idea was to start with this from scratch. I see three main areas to be improved:

performance
matching of the text layer and the canvas text
UX, e.g. less flickering on selecting bigger text blocks

The performance issues are mainly caused by to expensive text layers in terms of DOM. Basically for some types of PDFs each single character ends up in its own composite layer (in its own scaled absolute positioned element). This makes e.g. scrolling of these documents very slow.

There are IMO two main issues to improve matching of the text, ensure that really the correct font family gets selected (the PDFs are often lying about the font family, i did some tests with xor'ing the target font with all main font families, that turnes out to work pretty well). The other issue is word spacing. Currently the word spacing is not exposed to the text layer (there is some code which tries to address that with inserting spaces between text fragments if the white space is bigger than an according white space, but that covers only limited set of PDFs, e.g. not the ones which define word spacing instead). Exposing it makes adjustments in the text layer a bit more complicated, but it should IMO be possible. @Hengjie code mainly tries to improve the text matching.

UX issue, like flickering while selecting whole text blocks, are caused by the current DOM structure, e.g. an elemnt of the text layer contains at most only a single line.

This PR here addresses all issues to some extend. Most of the changes are in the text layer. But i think it would make more sense to move most of that code down to buildTextGeometry, basically move that call to it's own class and add to the according geometry objects also line numbers and ideally text block numbers. There are four PS instructions which call buildTextGeometry to create text geometry objects, OPS.showSpacedText, OPS.showText, OPS.nextLineShowText, OPS.nextLineSetSpacingShowText. Right now only OPS.showSpacedText accumulates text fragments in a geometry object, but at least OPS.showText should do that too (this is the main reason for the performance issues).

But the main issue is that i don't really have time anymore to work on this. I can probably update this PR here to the newest mainline, but i have no time to do the changes which i have had in mind. Perhaps discuss with @yurydelendik if that makes sense or if you would like to restart from scratch with these improvements?

mitar · 2014-06-20T08:45:01Z

UX issue, like flickering while selecting whole text blocks, are caused by the current DOM structure, e.g. an elemnt of the text layer contains at most only a single line.

I addressed this in #4843. If you make sure that text segments are in correct order of text flow, then flickering is fixed, I believe. Try it here: https://peerlibrary.org/p/tfgbf3kfxbQ4ubunj/trace-based-just-in-time-type-specialization-for-dynamic-languages

timvandermeij · 2014-06-20T17:45:05Z

@mitar That appears to work much smoother indeed, but I do notice that, while text selection is working, right-click and Copy does not work. Right-clicking undoes the selection and displays the regular context menu without Copy option.

mitar · 2014-06-20T17:49:51Z

@timvandermeij: This is some our text annotation code interfering I think. I will look into it, but I do not see that it is related to the text selection process itself.

timvandermeij · 2014-06-20T17:56:45Z

@mitar Oh, it's not, I just thought I'd mention it for your library (which looks nice by the way!) :-)

timvandermeij · 2014-06-20T17:57:00Z

@Hengjie Are you interested in continuing with your solution, taking the tips from @chriskr and @mitar into account?

Hengjie · 2014-07-15T04:40:16Z

Sorry for the late response. Yes, I'd be more than happy to take over the task of coalesing the texts.

@chriskr do you know where the word spacing information is? How can I extract it? The way I'm doing it right now is inferring that information based on the letter's absolute positions and the maximum char width of the font. It's a hacky solution and I'd prefer to remove all of that interpolation code.

The biggest problem I have with the code is that it completely breaks search highlighting. I'm unsure how to fix it and will probably need help as I'm unfamiliar with the rest of the PDF.js code.

I've also done a lot of work in fixing up selecting text and how dragging a pixel beyond the text will result in all text being selected.

chriskr · 2014-07-15T06:30:21Z

Word spacing is exposed on TextState as wordSpacing, see core/evaluator.js. I guess you will have to dig into that file a bit.

timvandermeij · 2014-07-15T18:26:23Z

@Hengjie Thank you! If you have a prototype, then we can look into why it breaks search highlighting. Fortunately, the search highlighting code isn't that complicated from what I've seen from it, so we can always help you look into that.

Hengjie · 2014-07-16T03:11:32Z

Yup, my code is kinda old, I'll polish it up with the latest master HEAD before I have a prototype.

Hengjie · 2014-07-30T12:13:18Z

Just wanted to update this issue, I'm in the middle of merging master into my branch.

jleclanche · 2014-07-31T16:39:58Z

Wanted to chime in with a pdf that has severe performance issues on selection:
http://www.meghnacambridge.co.uk/menu.pdf

Don't know if it helps, or if it's even too relevant. If it's not, I'll file a new bug.

timvandermeij · 2014-07-31T17:51:40Z

@jleclanche I can confirm that, but please open a separate issue for that.

jleclanche · 2014-07-31T17:53:17Z

Opened #5107

Hengjie · 2014-08-13T08:01:56Z

Guys, what do you think about doing the coalescing in getTextContent() in evaluator.js instead? I feel like this will resolve the exposing data to text_layer_builder.js problem that @chriskr described but also remove the absolute positioning data I'm inferring from the DOM in my code (for performance).

I'm not familiar with the evaluator.js but is it run in the worker?

Edit: Looks like it's from worker.

A note about the flickering, the issue with ordering is that sometimes PDFs would order it different and we'd have to reorder it. It also depends on the language, such ltr or rtl text.

I hacked up some code that's running in production that does the coalescing and fix flickering to an extent at https://web.notablepdf.com that you guys can try out. It's not perfect, really just hacky and not in a state that I think is PR-able. For eg, search highlighting depends on rangy-highlighter to work.

yurydelendik · 2014-08-13T14:52:48Z

Guys, what do you think about doing the coalescing in getTextContent() in evaluator.js instead?

@Hengjie yes, that's the indented location for that. We already running some heuristics there to minimize amount of divs.

fjaguero · 2014-11-10T17:59:03Z

Is anyone working on this? I was wondering if it's possible to try this feature with the conflicts.

timvandermeij · 2014-11-10T18:01:06Z

@Hengjie was the last one working on this if I am correct.

Hengjie · 2014-12-25T02:53:12Z

I've started working on this again and have coalescing code working in the worker's evaluator.js. I'm currently only coalescing words, but eventually want to coalesce to sentences and PR it. Longer term, the idea is to make it into a single div, so that we're coalescing paragraphs and columns together into a to reduce selection flickering further.

The performance of doing this is far better than @chriskr's approach because it doesn't rely on the DOM to get positioning information.

Hengjie · 2014-12-25T02:56:43Z

Btw, I've also written my own text selection padding algorithm but it looks like #5545 is far more sophisticated. I would prefer to wait until #5545 is merged since coalescing and padding are both complementary to better text selection goals.

Hengjie · 2015-01-03T10:44:09Z

@chriskr re font_metrics.js, the ascent seems to become 0 when zoom drops below 50% for the compressed.tracemonkey-pldi-09.pdf. The reason why is because the fontHeight is so small that getFirstNoneWhitePixelTop()'s xy iterations aren't granular enough. I haven't looked further than that, but I'd really prefer to use your font_metrics for the coalescing code I'm writing. Is there a solution to this problem?

Right now, I just fallback to the original PDF.js method in text_layer_builder.js's appendText(). See below:

      // More accurate ascent
      var font = fontHeight.toFixed(3) + 'px ' + style.fontFamily;
      var fontAscent = this.fontMetrics.getFontAscent(font, geom.str[0], style);

      // Fallback to less accurate since getFontAscent() can't get below a certain percision
      if (fontAscent === 0) {
        if (style.ascent) {
          fontAscent = style.ascent * fontHeight;
        } else if (style.descent) {
          fontAscent = (1 + style.descent) * fontHeight;
        }
      }

chriskr · 2015-01-03T11:39:48Z

I guess you would have to define a minimal font size to measure the ascent and scale the font accordingly before measuring it if it is smaller. That could probably be done in fontmetrics.js.

Hengjie · 2015-01-03T15:11:55Z

@chriskr thanks. Got it working.

ziahamza · 2015-03-17T03:05:22Z

Any updates on this pull request by any chance?

timvandermeij · 2015-11-07T14:43:20Z

This PR is superseded by #6590 and #6591 (and possibly others). Since this PR is stale and its functionality is now incorporated in PDF.js by the mentioned PRs, I'm closing this PR. If there are more improvements to make for the current text layer code, please open a new issue or PR.

yurydelendik added the 4-text-selection label Apr 15, 2014

yurydelendik self-assigned this Apr 15, 2014

Christian Krebs added 2 commits April 15, 2014 22:36

Support for multi-line text layer

3c6ea9c

Adds a feature flag to support multi-line text layer. The main purpose is to make text selection a bit more natural and to keep the DOM as leightweight as possible by using as much as possible inline elements.

Get the font ascent of the font used in the text layer

e3be5eb

Snuffleupagus mentioned this pull request May 26, 2014

Improving the way how text selection works #4843

Closed

yurydelendik mentioned this pull request May 29, 2014

words broken up into letters #4863

Closed

mitar mentioned this pull request Jun 20, 2014

Unable to copy text out using a context menu peerlibrary/peerlibrary#516

Open

timvandermeij added the 4-work-in-progress label Jun 23, 2014

Snuffleupagus mentioned this pull request Jul 2, 2014

Bug 810636 - Poor copy & paste behavior with pdf.js #2989

Closed

tilgovi mentioned this pull request Aug 20, 2014

Improve PDF selection experience using methods inspired by PeerLibrary hypothesis/h#905

Closed

fjaguero mentioned this pull request Nov 10, 2014

Support paragraphs grouping in the same div #5482

Closed

speedplane mentioned this pull request Dec 16, 2014

Improved Text Selection By Padding Text Elements #5545

Closed

Hengjie mentioned this pull request Jan 4, 2015

textLayer slightly off #5558

Closed

idea-launch-lab mentioned this pull request Oct 5, 2015

Text block in PDF-to-text conversion being broken up into multiple lines #6504

Closed

timvandermeij closed this Nov 7, 2015

Multiline text blocks #4629

Multiline text blocks #4629

Conversation

chriskr commented Apr 15, 2014

Snuffleupagus commented Apr 15, 2014

chriskr commented Apr 15, 2014

Snuffleupagus commented Apr 15, 2014

chriskr commented Apr 15, 2014

timvandermeij commented Apr 15, 2014

pdfjsbot commented Apr 15, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Apr 15, 2014

From: Bot.io (Linux)

Success

Published

timvandermeij commented Apr 15, 2014

chriskr commented Apr 15, 2014

Hengjie commented Apr 17, 2014

chriskr commented Apr 17, 2014

Snuffleupagus commented Apr 17, 2014

chriskr commented Apr 17, 2014

Hengjie commented Apr 18, 2014

Hengjie commented Apr 28, 2014

mitar commented May 26, 2014

timvandermeij commented Jun 17, 2014

chriskr commented Jun 18, 2014

mitar commented Jun 20, 2014

timvandermeij commented Jun 20, 2014

mitar commented Jun 20, 2014

timvandermeij commented Jun 20, 2014

timvandermeij commented Jun 20, 2014

Hengjie commented Jul 15, 2014

chriskr commented Jul 15, 2014

timvandermeij commented Jul 15, 2014

Hengjie commented Jul 16, 2014

Hengjie commented Jul 30, 2014

jleclanche commented Jul 31, 2014

timvandermeij commented Jul 31, 2014

jleclanche commented Jul 31, 2014

Hengjie commented Aug 13, 2014

yurydelendik commented Aug 13, 2014

fjaguero commented Nov 10, 2014

timvandermeij commented Nov 10, 2014

Hengjie commented Dec 25, 2014

Hengjie commented Dec 25, 2014

Hengjie commented Jan 3, 2015

chriskr commented Jan 3, 2015

Hengjie commented Jan 3, 2015

ziahamza commented Mar 17, 2015

timvandermeij commented Nov 7, 2015