Improved Text Selection By Padding Text Elements #5545

speedplane · 2014-12-16T17:32:05Z

The Problem

Text selection in pdf.js does not work so well. If while dragging, your cursor is not directly on top of a text element, then you get unexpected results. This occurs with nearly all examples, including the tracemonkey example. Try starting a selection, then dragging the mouse to a blank whitespace area beneath the initial selection. You'll see the selection gets reversed. If you keep moving your mouse, it flickers and is generally a horrible experience. I'm not the first to report this, it is the subject of at least #4629 and #4843.

The Solution

The reason why selection doesn't work is because most browsers rely on padding between text to detect text flow. The solution, therefore, is to increase padding of each text element so that each element runs right up next to the next element.

This Solution

In this solution, the padding is calculated when the text elements are added. For each added text element, we find the nearest neighbor to the right that is on the same line. We then set the padding-right of the text element to that distance. Similarly, for padding-bottom we find the next line that is underneath the text element, and set the padding to the difference between lines. Setting padding-bottom is especially helpful for PDFs that are double spaced (lots of space between lines).

The solution isn't all milk and honey though. The computation is O(N^2), which could be bad for complex PDFs. There are fancier algorithms to find the nearest neighbor, but for now, for each text element, we look at every other text element to see if it's nearest.

UPDATE: As discussed below, the O(N^2) issue in this pull request has been resolved by using a quadtree datastructure.

mitar · 2014-12-16T17:36:01Z

Does this solution still require text fragments to be ordered in the text flow order?

And what about if there are gaps between segments one cannot cover with a rectangle? So if you would make it too big, you would cover some other part, if you would make it too small, there would be a gap?

speedplane · 2014-12-16T17:52:30Z

Two answer @mitar's questions:

No, it does not assume ordered text flow. That's why there's an O(N^2) algorithm in there. We could reduce that algorithmic complexity if we made that assumption, but I agree, it's not a good assumption to make.
What type of situations are you considering?

timvandermeij · 2014-12-16T18:24:05Z

@speedplane Please squash the commits into one commit. See https://github.com/mozilla/pdf.js/wiki/Squashing-Commits on how to do this easily. Secondly, please make the code adhere to the style guide at https://github.com/mozilla/pdf.js/wiki/Style-Guide and make sure that linting passes by running node make lint locally.

speedplane · 2014-12-16T18:32:33Z

@timvandermeij will do

…t element so that they are placed next to eachother. Without any space between text elements, text selection is much more consistent.

speedplane · 2014-12-16T18:58:51Z

The most recent commit is squashed and LINT tests now pass.

timvandermeij · 2014-12-16T19:02:21Z

Hm, there are still lots of commits in this PR. Maybe you forgot to force-push during the squashing process?

speedplane · 2014-12-16T19:27:42Z

Okay, I'll try that over again. But there are some additional changes I want to make as well.

timvandermeij · 2014-12-16T19:28:32Z

Sure. The easiest way is to make those changes first, commit them and then squash all commits into one commit using the method described on the wiki page.

speedplane · 2014-12-16T19:49:28Z

@mitar After a bit more investigation, it looks like my fix does not fix the text flow issue. I don't think it will be too hard to fix though.

To clarify ... Selection appears to be dependent on both the padding AND the order of the text in a document. While my change creates padding that is order independent, it does not re-order the DOM elements, so it will not result in a nice flow.

speedplane · 2014-12-16T20:18:17Z

@mitar We can pretty easily re-order the text flow according to some metric, but I'm not sure what that metric would actually be. Take a look at the following image below. Assuming we were able to re-order the fragments any way we wanted, how would we do it? If you can think up an algorithm or heuristic, I can give it a try.

mitar · 2014-12-16T20:22:59Z

Can you check how my approach is working with this document?

speedplane · 2014-12-16T20:53:21Z

@mitar How do I run your approach? The document I am referring to is located here

mitar · 2014-12-16T20:54:15Z

You can import it into http://peerlibrary.org/ and open it there.

speedplane · 2014-12-16T20:57:28Z

@mitar I just tried it, and it doesn't work quite right. For example, on page 2 of the document, click in the whitespace between the line number 24 and the text "JUDGE GIANNETTI", and then drag down. If it was working properly, the line with JUDGE GIANNETTI would be highlighted and the 24 would not. However, it looks like the 24 is being highlighted.

speedplane · 2014-12-16T21:35:43Z

This change entirely breaks copy/paste. But then again, it looks like copy/paste is totally broken in master as well. Try doing any copying and pasting in tracemonkey.

timvandermeij · 2014-12-16T21:36:38Z

@speedplane Copy-paste works fine for me for most PDFs, and Tracemonkey is no exception. It will be a problem if this PR breaks that.

speedplane · 2014-12-16T21:45:07Z

@timvandermeij You sure? Try copying and pasting the portion of text in TraceMonkey indicated in the image below.

You get the following: Since no concrete type informationis available,
But you should get the following: Since no concrete type information is available,

There clearly is a problem with handling new lines and copy paste.

speedplane · 2014-12-16T22:09:36Z

@mitar Returning to the image below, how can you tell that the line numbers are not a skinny column of text and should be in a single text group?

mitar · 2014-12-16T22:11:06Z

I think you should do some image segmentation. Searching for sections of text segments which are close to each other.

speedplane · 2014-12-16T22:23:11Z

I'm working on the following heuristic for horizontal lines.

For each textDiv

If there is textDiv directly to the right, then it is part of the same line, and the right textDiv is ordered in the text flow appropriately.
However, to take care of columns, if there is a textDiv directly below, and the textDiv's width is more than 20% of the page width, then assume it's part of a text column.

Vertical lines are similar but reversed. This works with (1) non-column text with good text ordering, (2) column text with good text ordering, and (3) non-columned text with bad text ordering. It probably will not work for column text with bad text ordering, but I doubt any PDF reader works well with those (Adobe doesn't).

speedplane · 2014-12-16T22:31:13Z

@timvandermeij Take a look at how Adobe PDF copy/pastes the first two sentences of TraceMonkey:

Dynamic languages such as JavaScript are more difficult to compile
than statically typed ones. Since no concrete type information
is available,

Beyond using line-breaks, notice that they fixed the hyphenated comp-ile into a single word. Impressive.

speedplane · 2014-12-17T00:30:50Z

Phew... that was a lot of fast and furious coding. You can see the new viewer in action here: www.docketalarm.com/cases/PTAB/IPR2013-00142/Inter_Partes_Review_of_U.S._Pat._6931558/06-25-2014-PM-23445/Notice-50-Oral_Hearing_Transcript/?pdfjs

Take a look at how selection works, also try loading in other documents in the viewer, I think it's working out okay.

Some ongoing concerns: the O(N^2) issue, right-to-left text, rows of vertical text (not sure how often that comes up), and there is a small chance that this messes up column text selection (TraceMonkey and other Latex articles are fine).

I'll get to squashing pretty soon, but I think this is a big improvement.

speedplane · 2014-12-17T00:39:01Z

Grr, not quite, if you take a look at page 4 of TraceMonkey with my new viewer, you can see there is a regression with the word "question" in the second column. I have an idea how to fix it.

mitar · 2014-12-17T05:17:47Z

Checked it out. This N^2 is really killing it. I am not sure if padding segments is not better performance wise.

But otherwise it works very well. I am impressed. I think that my issue was that I was using static padding on bottom and right, hoping that later text segments will always be over the previous and that didn't work.

I still see that selection jumps up sometimes on edges between text segments, though. You can see that well if you move the mouse slowly while selecting. But I still think this is acceptable.

But not everything works well, please look at this tricky PDF: https://peerlibrary.org/p/JCCuLbqynAPwwuos8 If you open left-top menu you will have a download link. Open it in your system and start selecting the title and then move down, below the title. You would expect that whole title gets selected, but in fact selection jumps up and selects header instead. See related ticket here: peerlibrary/peerlibrary#387

Have you tested this with papers with equations?

speedplane · 2014-12-17T06:00:49Z

@mitar I tried that example. The reason selection is messed up is because everything above the title is a header, and it occurs later on in the PDF despite being on top of the title. On the one hand, Adobe does the same thing we do now, so if we're good as Adobe, then we're arguably good enough. On the other hand, screw that.

I have a potential solution for this floating in my head, but it will take another pass at all of the elements, possibly another N^2 pass. I'm not sure it's worthwhile.

Regarding fixing the N^2 issue, that too is possible. We're basically performing a nearest neighbor search, and there are a variety of algorithms that are less than N^2 to do that. Implementing those is another story however.

mitar · 2014-12-17T06:03:58Z

On the one hand, Adobe does the same thing we do now, so if we're good as Adobe, then we're arguably good enough. On the other hand, screw that.

For me Adobe Reader 9 works well there. And Mac OS X Preview as well. So I think it is something we should support.

mitar · 2014-12-17T06:05:09Z

One general solution would be that PDF.js would reorder text segments. This would allow also other improvements. Like when using PDF.js to extract text from the PDF, the text would come in reading flow.

mitar · 2015-03-11T06:41:50Z

It does here: https://github.com/peerlibrary/peerlibrary/blob/development/client/lib/highlighter.coffee#L244

zhsc · 2015-04-02T18:10:52Z

These changes do improve the text selection experience, but I did want to point out that the padding changes do complicate getting the coordinates of a text Range via getClientRects() as the rectangles include the extra padding in the dimensions.

It doesn't directly impact the pdf viewer functionality, but any extensions people have made to do annotation or highlighting, etc could be impacted (like myself).

mitar · 2015-04-03T04:08:58Z

It doesn't directly impact the pdf viewer functionality, but any extensions people have made to do annotation or highlighting, etc could be impacted (like myself).

Side note (coming from annotation word): maybe use open annotation standard and store annotations in text-form instead of coordinates. See http://hypothes.is/ as an example how it works with pdf.js.

zhsc · 2015-04-03T13:32:19Z

@mitar Thanks for the tip, that's a nice site. I don't want to hijack this thread too much, but for my specific use-case, I need the coordinates so I can later draw the highlights / annotations onto an actual PDF file itself later. I will experiment with that site some more and see if I can glean their approach better.

I'm currently working on a workaround to try and subtract the padding out of the results of the getClientRects results, with mixed results.

I believe there was a different approach you (@mitar) took for your site with dynamic padding. Do you know whether it would suffer the same issue. My thought is it wouldn't unless the mouse was still over the last text block when getting the coordinates...

zhsc · 2015-04-03T15:04:40Z

Just to close the loop (in case anyone else stumbles on this), I was able to work around this issue by simply walking the original selection range and creating sub-ranges only for the text nodes and then calling getClientRects on those sub ranges. This gives me the coordinates like before without the padding being included. Thanks

Hengjie · 2015-07-08T01:23:22Z

I found and fixed two bugs. Here's a diff file to fix it:

It fixes issues extending dyn padding to the top of page

---
 src/core/text_layout_evaluator.js | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/core/text_layout_evaluator.js b/src/core/text_layout_evaluator.js
index 780e7b1..bcd872c 100644
--- a/src/core/text_layout_evaluator.js
+++ b/src/core/text_layout_evaluator.js
@@ -118,7 +118,7 @@ var TextLayoutEvaluator = (function TextLayoutEvaluatorClosure() {
         // We're looking for items above this item, so start from the bottom.
         it = quadtree_vert.retrieveYInc(obj.x1, obj.y1, obj.x2 - obj.x1);
         while (objN = it.next()) {
-          if (objN.id !== obj.id && objN.y + objN.width > top) {
+          if (objN.id !== obj.id && objN.y + objN.height > top) {
             obj.top = objN.id;
             break;
           }
@@ -149,7 +149,7 @@ var TextLayoutEvaluator = (function TextLayoutEvaluatorClosure() {
         }

         // Objects do not get extended upwards unless there is nothing above.
-        if (obj.top === undefined) {
+        if (obj.top === undefined || obj.top === null) {
           // Extend to the top
           obj.fullHeight = bounds.height - obj.fullY1;
         }
-- 
2.3.2

nschloe · 2015-07-29T12:56:42Z

What is required for this to be merged, other than the conflict resolution? Any tests missing?

timvandermeij · 2015-07-29T17:55:33Z

This needs to be reviewed and tested more, but since the main developers are currently on vacation, that might take some time.

timvandermeij · 2015-11-07T14:47:01Z

Now that we have #6590 and #6591 (among others) merged, is this PR still needed? Since it does add a lot of complexity, are there still PDF files that benefit from this solution that are not already fixed by the mentioned PRs?

speedplane · 2015-11-07T15:59:25Z

The other PRs improve the situation, but don't fix it. You still get flashing text, but it flashes downwards instead of upwards. When you select text and move your mouse into a blank area, the entire page gets selected. Instead, the user would expect that the selection advances line by line. This PR offers that smooth text selection experience.

I have time to work on this more now if the team thinks it's eventually worth merging.

yurydelendik · 2015-11-07T18:34:26Z

@speedplane sounds good. I'm still considering this PR -- whole or in parts depending on risk we are willing to take. The #6590 reduces load on the algorithm in this PR.

You still get flashing text, but it flashes downwards instead of upwards.

Notice that currently Firefox has no such issue. Non-Firefox browser does not implement -webkit-user-select: none properly IMHO (see comments in #6591).

speedplane · 2015-11-07T18:45:53Z

Yes, I took a look. #6590 seems to have solved the issue for Firefox, just not webkit. Very tempting to go with the simpler solution and hope that the others fix -webkit-user-select.

nschloe · 2015-11-07T21:09:38Z

What's missing for me from HEAD is the vertical positioning in the text layer as outlined in the above comment.

yurydelendik · 2015-11-07T23:20:33Z

What's missing for me from HEAD is the vertical positioning

@nschloe Preview on Mac OSX has exact selection alignment for me -- looks like PDF might contain some weird font metrics (and Adobe Reader knows how to handle them).

nschloe · 2015-11-09T20:05:04Z

@yurydelendik I'm running on pdfjs-dist 1.2.83 and still see the issue for some PDFs. -- Perhaps the fix is not in there yet? Anyhow, not only Adobe Reader had that fixed, but also the OP's PR.

yurydelendik · 2015-11-09T20:44:13Z

still see the issue for

I'm afraid you were not specific enough about issues you see. I would recommend opening issues for both of these PDF and describe in details what you see (see https://github.com/mozilla/pdf.js/blob/master/CONTRIBUTING.md). Proposed solution here adds the padding to the text (were that positioned wrong or right). The grouping PR6590 is trying to fix positioning and reduce amount of text runs we will be dealing here. The user-select PR6591 is disabling selection of non-text on the page.

If the last two PRs broke the behavior of some PDFs we would like to know that. However that is off-topic of this discussion.

nschloe · 2015-11-09T20:55:45Z

@yurydelendik Right on, right on, I'll open a new issue for it.

yurydelendik · 2015-11-11T16:50:39Z

I reviewed most of the patch, however I stopped at text_layer_builder.js -- the text layer builder changes needs to be tested and I'm blocking this PR by #6619 . More review will be need once review comments are addressed.

With accepting of this PR we will take a big performance/memory hit due to DOM elements having more data and probably override PR #5221 improvement (so I also block this by testing it with PDF files of fixed issues, e.g. #4626). Due to that (and having simpler solution for FF at #6591), we want to pref this algorithm off and, probably, move it back to src/display/ namespace.

Thank you.

Reviewed 7 of 9 files at r1.
Review status: 7 of 9 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

^{src/core/quadtree.js, line 12 [r1] (raw file):}
Here and below most of "bY" instances must be "by" in comments.

^{src/core/quadtree.js, line 21 [r1] (raw file):}
Use consistent (with PDF.js code base) comments style -- use where it is possible jsdoc comment. (I think we shall avoid *******-type dividers). I see jsdoc comments below but they look strangely formatted for me.

^{src/core/quadtree.js, line 52 [r1] (raw file):}
Remove Array support from here -- unneeded recursion.

^{src/core/quadtree.js, line 58 [r1] (raw file):}
b.x + b.width or b.y + b.height is used a lot -- will it better to have left,top,right,bottom or (x1,y1,x2,y2) as bounds?

^{src/core/quadtree.js, line 70 [r1] (raw file):}
Probably we shall remove all print-like code from the production code.

^{src/core/quadtree.js, line 93 [r1] (raw file):}
The sorter function (here and three below) can be defined once at the "module" level.

^{src/core/quadtree.js, line 97 [r1] (raw file):}
There is no need to create side1 and side2 array -- it will be better if two items are passed as variables

^{src/core/quadtree.js, line 160 [r1] (raw file):}
Having those as a properties rather than constants might reduce the performance of the code

^{src/core/quadtree.js, line 190 [r1] (raw file):}
This kind of check is use a lot -- will it be better to split QNode into two object QNode and QLeaf? That shall reduce the recursion that call itself (e.g. this.insert())

^{src/core/quadtree.js, line 247 [r1] (raw file):}
Can we avoid of creation of new object? e.g. use flags here, also all indices/properties will be strings at this point.

^{src/core/quadtree.js, line 260 [r1] (raw file):}
We don't have to make it bHalfXXX integer.

^{src/core/quadtree.js, line 307 [r1] (raw file):}
nit: avoid space at end of lines

^{src/core/quadtree.js, line 335 [r1] (raw file):}
This adds O(N logN) complexity at leaf level, also it constantly sorting element back and forward. I think it you will define limited set of sorters you can cache four instances of sorted array.

^{src/core/quadtree.js, line 373 [r1] (raw file):}
(It will be cool if NodeIterator and LeafIterator were implemented using https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols api )

^{src/core/quadtree.js, line 379 [r1] (raw file):}
you have to null it1/it0 once last item was found -- if tree is not well balanced, will we calling next() on all non-leaf nodes without a need

^{src/core/text_layout_evaluator.js, line 162 [r1] (raw file):}
id shall be a string here to make !(c.id in this.deduper) efficient.

^{web/text_layer_builder.js, line 167 [r1] (raw file):}
Not sure what does it mean, but adding unneeded DOM data might be a reason for performance hit. (See also #5221)

^{web/viewer.css, line 1274 [r1] (raw file):}
Are those changes related to this PR? If yes, will it be a better place for them at text_layer_builder.css?

^{web/viewer.css, line 1280 [r1] (raw file):}
Fixes #6614 ?

Comments from the review on Reviewable.io

fdstevex · 2016-07-18T16:02:50Z

For what it's worth, I applied this PR to the pdfjs we're using and it's a significant improvement in the usability of drag-selecting text. I understand it has the potential to affect performance, but in our case, with some large and complex PDFs, we've had no problems.

mgol · 2016-07-26T11:28:29Z

@timvandermeij Also, regarding enforcing a particular style, the following project may be helfpul: https://github.com/jscs-dev/node-jscs

@speedplane JSCS is deprecated now, it got merged with ESLint.

Snuffleupagus · 2016-07-26T11:31:45Z

Should this PR be closed in favour of #6663, given #6663 (comment)?

speedplane · 2016-07-26T17:15:58Z

@Snuffleupagus I'm pretty sure that #6663 has a O(N^2) issue with respect to the number of divs on the page. I think you'll need a quadtree structure to do efficient spacial lookups and avoid that. Once that is added, I'm not sure how much simpler one implementation is than the other.

timvandermeij · 2016-09-01T14:42:30Z

Closing as superseded by #7539. We file a follow-up issue for enabling it by default, for which we need to make some performance improvements. With those in place, we should have similar performance as in this PR, but with a simpler solution.

Nevertheless, thank you for working on this, though. It definitely inspired others to look into this issue and now get the most important part merged in the codebase!

mitar · 2016-09-01T19:23:44Z

I am really glad that we saw progress here.

jlukic · 2017-01-19T01:52:33Z

@mitar you are all over the web. Also congrats on the hardwork here.

speedplane mentioned this pull request Dec 16, 2014

Improving the way how text selection works #4843

Closed

timvandermeij added the text-selection label Dec 16, 2014

yurydelendik mentioned this pull request Nov 4, 2015

Non/less flashing text layer during selection. #6591

Merged

yurydelendik self-assigned this Nov 7, 2015

nschloe mentioned this pull request Nov 9, 2015

Vertical positioning of text layer off (for some documents) #6614

Closed

yurydelendik mentioned this pull request Nov 19, 2015

Expanding divs to improve selection. #6663

Closed

4 tasks

timvandermeij closed this Sep 1, 2016

Improved Text Selection By Padding Text Elements #5545

Improved Text Selection By Padding Text Elements #5545

Conversation

speedplane commented Dec 16, 2014

The Problem

The Solution

Other Solutions

This Solution

mitar commented Dec 16, 2014

speedplane commented Dec 16, 2014

timvandermeij commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 16, 2014

timvandermeij commented Dec 16, 2014

speedplane commented Dec 16, 2014

timvandermeij commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 16, 2014

mitar commented Dec 16, 2014

speedplane commented Dec 16, 2014

mitar commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 16, 2014

timvandermeij commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 16, 2014

mitar commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 16, 2014

speedplane commented Dec 17, 2014

speedplane commented Dec 17, 2014

mitar commented Dec 17, 2014

speedplane commented Dec 17, 2014

mitar commented Dec 17, 2014

mitar commented Dec 17, 2014

mitar commented Mar 11, 2015

zhsc commented Apr 2, 2015

mitar commented Apr 3, 2015

zhsc commented Apr 3, 2015

zhsc commented Apr 3, 2015

Hengjie commented Jul 8, 2015

nschloe commented Jul 29, 2015

timvandermeij commented Jul 29, 2015

timvandermeij commented Nov 7, 2015

speedplane commented Nov 7, 2015

yurydelendik commented Nov 7, 2015

speedplane commented Nov 7, 2015

nschloe commented Nov 7, 2015

yurydelendik commented Nov 7, 2015

nschloe commented Nov 9, 2015

yurydelendik commented Nov 9, 2015

nschloe commented Nov 9, 2015

yurydelendik commented Nov 11, 2015

fdstevex commented Jul 18, 2016

mgol commented Jul 26, 2016

Snuffleupagus commented Jul 26, 2016

speedplane commented Jul 26, 2016

timvandermeij commented Sep 1, 2016

mitar commented Sep 1, 2016

jlukic commented Jan 19, 2017 • edited Loading

jlukic commented Jan 19, 2017 •

edited

Loading