Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Text Selection By Padding Text Elements #5545

Closed

Conversation

speedplane
Copy link

The Problem

Text selection in pdf.js does not work so well. If while dragging, your cursor is not directly on top of a text element, then you get unexpected results. This occurs with nearly all examples, including the tracemonkey example. Try starting a selection, then dragging the mouse to a blank whitespace area beneath the initial selection. You'll see the selection gets reversed. If you keep moving your mouse, it flickers and is generally a horrible experience. I'm not the first to report this, it is the subject of at least #4629 and #4843.

The Solution

The reason why selection doesn't work is because most browsers rely on padding between text to detect text flow. The solution, therefore, is to increase padding of each text element so that each element runs right up next to the next element.

Other Solutions

Adding padding to each text element was recognized as a possible solution by @mitar in #4843. However, in #4843, the suggested solution was relatively complex, which included modifying the padding dynamically whenever a mouse moves. It was also written in CoffeeScript for a completely separate viewer and porting it seemed difficult.

This Solution

In this solution, the padding is calculated when the text elements are added. For each added text element, we find the nearest neighbor to the right that is on the same line. We then set the padding-right of the text element to that distance. Similarly, for padding-bottom we find the next line that is underneath the text element, and set the padding to the difference between lines. Setting padding-bottom is especially helpful for PDFs that are double spaced (lots of space between lines).

The solution isn't all milk and honey though. The computation is O(N^2), which could be bad for complex PDFs. There are fancier algorithms to find the nearest neighbor, but for now, for each text element, we look at every other text element to see if it's nearest.

UPDATE: As discussed below, the O(N^2) issue in this pull request has been resolved by using a quadtree datastructure.

Review on Reviewable

@mitar
Copy link
Contributor

mitar commented Dec 16, 2014

Does this solution still require text fragments to be ordered in the text flow order?

And what about if there are gaps between segments one cannot cover with a rectangle? So if you would make it too big, you would cover some other part, if you would make it too small, there would be a gap?

@speedplane
Copy link
Author

Two answer @mitar's questions:

  • No, it does not assume ordered text flow. That's why there's an O(N^2) algorithm in there. We could reduce that algorithmic complexity if we made that assumption, but I agree, it's not a good assumption to make.
  • What type of situations are you considering?

@timvandermeij
Copy link
Contributor

@speedplane Please squash the commits into one commit. See https://github.com/mozilla/pdf.js/wiki/Squashing-Commits on how to do this easily. Secondly, please make the code adhere to the style guide at https://github.com/mozilla/pdf.js/wiki/Style-Guide and make sure that linting passes by running node make lint locally.

@speedplane
Copy link
Author

@timvandermeij will do

speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Dec 16, 2014
…t element so that they are placed next to eachother. Without any space between text elements, text selection is much more consistent.
@speedplane
Copy link
Author

The most recent commit is squashed and LINT tests now pass.

@timvandermeij
Copy link
Contributor

Hm, there are still lots of commits in this PR. Maybe you forgot to force-push during the squashing process?

@speedplane
Copy link
Author

Okay, I'll try that over again. But there are some additional changes I want to make as well.

@timvandermeij
Copy link
Contributor

Sure. The easiest way is to make those changes first, commit them and then squash all commits into one commit using the method described on the wiki page.

@speedplane
Copy link
Author

@mitar After a bit more investigation, it looks like my fix does not fix the text flow issue. I don't think it will be too hard to fix though.

To clarify ... Selection appears to be dependent on both the padding AND the order of the text in a document. While my change creates padding that is order independent, it does not re-order the DOM elements, so it will not result in a nice flow.

@speedplane
Copy link
Author

@mitar We can pretty easily re-order the text flow according to some metric, but I'm not sure what that metric would actually be. Take a look at the following image below. Assuming we were able to re-order the fragments any way we wanted, how would we do it? If you can think up an algorithm or heuristic, I can give it a try.

image

@mitar
Copy link
Contributor

mitar commented Dec 16, 2014

Can you check how my approach is working with this document?

@speedplane
Copy link
Author

@mitar How do I run your approach? The document I am referring to is located here

@mitar
Copy link
Contributor

mitar commented Dec 16, 2014

You can import it into http://peerlibrary.org/ and open it there.

@speedplane
Copy link
Author

@mitar I just tried it, and it doesn't work quite right. For example, on page 2 of the document, click in the whitespace between the line number 24 and the text "JUDGE GIANNETTI", and then drag down. If it was working properly, the line with JUDGE GIANNETTI would be highlighted and the 24 would not. However, it looks like the 24 is being highlighted.

@speedplane
Copy link
Author

This change entirely breaks copy/paste. But then again, it looks like copy/paste is totally broken in master as well. Try doing any copying and pasting in tracemonkey.

@timvandermeij
Copy link
Contributor

@speedplane Copy-paste works fine for me for most PDFs, and Tracemonkey is no exception. It will be a problem if this PR breaks that.

@speedplane
Copy link
Author

@timvandermeij You sure? Try copying and pasting the portion of text in TraceMonkey indicated in the image below.

You get the following: Since no concrete type informationis available,
But you should get the following: Since no concrete type information is available,

There clearly is a problem with handling new lines and copy paste.

image

@speedplane
Copy link
Author

@mitar Returning to the image below, how can you tell that the line numbers are not a skinny column of text and should be in a single text group?

image

@mitar
Copy link
Contributor

mitar commented Dec 16, 2014

I think you should do some image segmentation. Searching for sections of text segments which are close to each other.

@speedplane
Copy link
Author

I'm working on the following heuristic for horizontal lines.

For each textDiv

  • If there is textDiv directly to the right, then it is part of the same line, and the right textDiv is ordered in the text flow appropriately.
  • However, to take care of columns, if there is a textDiv directly below, and the textDiv's width is more than 20% of the page width, then assume it's part of a text column.

Vertical lines are similar but reversed. This works with (1) non-column text with good text ordering, (2) column text with good text ordering, and (3) non-columned text with bad text ordering. It probably will not work for column text with bad text ordering, but I doubt any PDF reader works well with those (Adobe doesn't).

@speedplane
Copy link
Author

@timvandermeij Take a look at how Adobe PDF copy/pastes the first two sentences of TraceMonkey:

Dynamic languages such as JavaScript are more difficult to compile
than statically typed ones. Since no concrete type information
is available,

Beyond using line-breaks, notice that they fixed the hyphenated comp-ile into a single word. Impressive.

@speedplane
Copy link
Author

Phew... that was a lot of fast and furious coding. You can see the new viewer in action here: www.docketalarm.com/cases/PTAB/IPR2013-00142/Inter_Partes_Review_of_U.S._Pat._6931558/06-25-2014-PM-23445/Notice-50-Oral_Hearing_Transcript/?pdfjs

Take a look at how selection works, also try loading in other documents in the viewer, I think it's working out okay.

Some ongoing concerns: the O(N^2) issue, right-to-left text, rows of vertical text (not sure how often that comes up), and there is a small chance that this messes up column text selection (TraceMonkey and other Latex articles are fine).

I'll get to squashing pretty soon, but I think this is a big improvement.

@speedplane
Copy link
Author

Grr, not quite, if you take a look at page 4 of TraceMonkey with my new viewer, you can see there is a regression with the word "question" in the second column. I have an idea how to fix it.

@mitar
Copy link
Contributor

mitar commented Dec 17, 2014

Checked it out. This N^2 is really killing it. I am not sure if padding segments is not better performance wise.

But otherwise it works very well. I am impressed. I think that my issue was that I was using static padding on bottom and right, hoping that later text segments will always be over the previous and that didn't work.

I still see that selection jumps up sometimes on edges between text segments, though. You can see that well if you move the mouse slowly while selecting. But I still think this is acceptable.

But not everything works well, please look at this tricky PDF: https://peerlibrary.org/p/JCCuLbqynAPwwuos8 If you open left-top menu you will have a download link. Open it in your system and start selecting the title and then move down, below the title. You would expect that whole title gets selected, but in fact selection jumps up and selects header instead. See related ticket here: peerlibrary/peerlibrary#387

Have you tested this with papers with equations?

@speedplane
Copy link
Author

@mitar I tried that example. The reason selection is messed up is because everything above the title is a header, and it occurs later on in the PDF despite being on top of the title. On the one hand, Adobe does the same thing we do now, so if we're good as Adobe, then we're arguably good enough. On the other hand, screw that.

I have a potential solution for this floating in my head, but it will take another pass at all of the elements, possibly another N^2 pass. I'm not sure it's worthwhile.

Regarding fixing the N^2 issue, that too is possible. We're basically performing a nearest neighbor search, and there are a variety of algorithms that are less than N^2 to do that. Implementing those is another story however.

@mitar
Copy link
Contributor

mitar commented Dec 17, 2014

On the one hand, Adobe does the same thing we do now, so if we're good as Adobe, then we're arguably good enough. On the other hand, screw that.

For me Adobe Reader 9 works well there. And Mac OS X Preview as well. So I think it is something we should support.

@mitar
Copy link
Contributor

mitar commented Dec 17, 2014

One general solution would be that PDF.js would reorder text segments. This would allow also other improvements. Like when using PDF.js to extract text from the PDF, the text would come in reading flow.

@mitar
Copy link
Contributor

mitar commented Mar 11, 2015

@zhsc
Copy link

zhsc commented Apr 2, 2015

These changes do improve the text selection experience, but I did want to point out that the padding changes do complicate getting the coordinates of a text Range via getClientRects() as the rectangles include the extra padding in the dimensions.

It doesn't directly impact the pdf viewer functionality, but any extensions people have made to do annotation or highlighting, etc could be impacted (like myself).

@mitar
Copy link
Contributor

mitar commented Apr 3, 2015

It doesn't directly impact the pdf viewer functionality, but any extensions people have made to do annotation or highlighting, etc could be impacted (like myself).

Side note (coming from annotation word): maybe use open annotation standard and store annotations in text-form instead of coordinates. See http://hypothes.is/ as an example how it works with pdf.js.

@zhsc
Copy link

zhsc commented Apr 3, 2015

@mitar Thanks for the tip, that's a nice site. I don't want to hijack this thread too much, but for my specific use-case, I need the coordinates so I can later draw the highlights / annotations onto an actual PDF file itself later. I will experiment with that site some more and see if I can glean their approach better.

I'm currently working on a workaround to try and subtract the padding out of the results of the getClientRects results, with mixed results.

I believe there was a different approach you (@mitar) took for your site with dynamic padding. Do you know whether it would suffer the same issue. My thought is it wouldn't unless the mouse was still over the last text block when getting the coordinates...

@zhsc
Copy link

zhsc commented Apr 3, 2015

Just to close the loop (in case anyone else stumbles on this), I was able to work around this issue by simply walking the original selection range and creating sub-ranges only for the text nodes and then calling getClientRects on those sub ranges. This gives me the coordinates like before without the padding being included. Thanks

@Hengjie
Copy link
Contributor

Hengjie commented Jul 8, 2015

I found and fixed two bugs. Here's a diff file to fix it:

It fixes issues extending dyn padding to the top of page

---
 src/core/text_layout_evaluator.js | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/core/text_layout_evaluator.js b/src/core/text_layout_evaluator.js
index 780e7b1..bcd872c 100644
--- a/src/core/text_layout_evaluator.js
+++ b/src/core/text_layout_evaluator.js
@@ -118,7 +118,7 @@ var TextLayoutEvaluator = (function TextLayoutEvaluatorClosure() {
         // We're looking for items above this item, so start from the bottom.
         it = quadtree_vert.retrieveYInc(obj.x1, obj.y1, obj.x2 - obj.x1);
         while (objN = it.next()) {
-          if (objN.id !== obj.id && objN.y + objN.width > top) {
+          if (objN.id !== obj.id && objN.y + objN.height > top) {
             obj.top = objN.id;
             break;
           }
@@ -149,7 +149,7 @@ var TextLayoutEvaluator = (function TextLayoutEvaluatorClosure() {
         }

         // Objects do not get extended upwards unless there is nothing above.
-        if (obj.top === undefined) {
+        if (obj.top === undefined || obj.top === null) {
           // Extend to the top
           obj.fullHeight = bounds.height - obj.fullY1;
         }
-- 
2.3.2

@nschloe
Copy link
Contributor

nschloe commented Jul 29, 2015

What is required for this to be merged, other than the conflict resolution? Any tests missing?

@timvandermeij
Copy link
Contributor

This needs to be reviewed and tested more, but since the main developers are currently on vacation, that might take some time.

@timvandermeij
Copy link
Contributor

Now that we have #6590 and #6591 (among others) merged, is this PR still needed? Since it does add a lot of complexity, are there still PDF files that benefit from this solution that are not already fixed by the mentioned PRs?

@speedplane
Copy link
Author

The other PRs improve the situation, but don't fix it. You still get flashing text, but it flashes downwards instead of upwards. When you select text and move your mouse into a blank area, the entire page gets selected. Instead, the user would expect that the selection advances line by line. This PR offers that smooth text selection experience.

I have time to work on this more now if the team thinks it's eventually worth merging.

@yurydelendik
Copy link
Contributor

@speedplane sounds good. I'm still considering this PR -- whole or in parts depending on risk we are willing to take. The #6590 reduces load on the algorithm in this PR.

You still get flashing text, but it flashes downwards instead of upwards.

Notice that currently Firefox has no such issue. Non-Firefox browser does not implement -webkit-user-select: none properly IMHO (see comments in #6591).

@yurydelendik yurydelendik self-assigned this Nov 7, 2015
@speedplane
Copy link
Author

Yes, I took a look. #6590 seems to have solved the issue for Firefox, just not webkit. Very tempting to go with the simpler solution and hope that the others fix -webkit-user-select.

@nschloe
Copy link
Contributor

nschloe commented Nov 7, 2015

What's missing for me from HEAD is the vertical positioning in the text layer as outlined in the above comment.

@yurydelendik
Copy link
Contributor

What's missing for me from HEAD is the vertical positioning

@nschloe Preview on Mac OSX has exact selection alignment for me -- looks like PDF might contain some weird font metrics (and Adobe Reader knows how to handle them).

@nschloe
Copy link
Contributor

nschloe commented Nov 9, 2015

@yurydelendik I'm running on pdfjs-dist 1.2.83 and still see the issue for some PDFs. -- Perhaps the fix is not in there yet? Anyhow, not only Adobe Reader had that fixed, but also the OP's PR.

@yurydelendik
Copy link
Contributor

still see the issue for

I'm afraid you were not specific enough about issues you see. I would recommend opening issues for both of these PDF and describe in details what you see (see https://github.com/mozilla/pdf.js/blob/master/CONTRIBUTING.md). Proposed solution here adds the padding to the text (were that positioned wrong or right). The grouping PR6590 is trying to fix positioning and reduce amount of text runs we will be dealing here. The user-select PR6591 is disabling selection of non-text on the page.

If the last two PRs broke the behavior of some PDFs we would like to know that. However that is off-topic of this discussion.

@nschloe
Copy link
Contributor

nschloe commented Nov 9, 2015

@yurydelendik Right on, right on, I'll open a new issue for it.

@yurydelendik
Copy link
Contributor

I reviewed most of the patch, however I stopped at text_layer_builder.js -- the text layer builder changes needs to be tested and I'm blocking this PR by #6619 . More review will be need once review comments are addressed.

With accepting of this PR we will take a big performance/memory hit due to DOM elements having more data and probably override PR #5221 improvement (so I also block this by testing it with PDF files of fixed issues, e.g. #4626). Due to that (and having simpler solution for FF at #6591), we want to pref this algorithm off and, probably, move it back to src/display/ namespace.

Thank you.


Reviewed 7 of 9 files at r1.
Review status: 7 of 9 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


src/core/quadtree.js, line 12 [r1] (raw file):
Here and below most of "bY" instances must be "by" in comments.


src/core/quadtree.js, line 21 [r1] (raw file):
Use consistent (with PDF.js code base) comments style -- use where it is possible jsdoc comment. (I think we shall avoid *******-type dividers). I see jsdoc comments below but they look strangely formatted for me.


src/core/quadtree.js, line 52 [r1] (raw file):
Remove Array support from here -- unneeded recursion.


src/core/quadtree.js, line 58 [r1] (raw file):
b.x + b.width or b.y + b.height is used a lot -- will it better to have left,top,right,bottom or (x1,y1,x2,y2) as bounds?


src/core/quadtree.js, line 70 [r1] (raw file):
Probably we shall remove all print-like code from the production code.


src/core/quadtree.js, line 93 [r1] (raw file):
The sorter function (here and three below) can be defined once at the "module" level.


src/core/quadtree.js, line 97 [r1] (raw file):
There is no need to create side1 and side2 array -- it will be better if two items are passed as variables


src/core/quadtree.js, line 160 [r1] (raw file):
Having those as a properties rather than constants might reduce the performance of the code


src/core/quadtree.js, line 190 [r1] (raw file):
This kind of check is use a lot -- will it be better to split QNode into two object QNode and QLeaf? That shall reduce the recursion that call itself (e.g. this.insert())


src/core/quadtree.js, line 247 [r1] (raw file):
Can we avoid of creation of new object? e.g. use flags here, also all indices/properties will be strings at this point.


src/core/quadtree.js, line 260 [r1] (raw file):
We don't have to make it bHalfXXX integer.


src/core/quadtree.js, line 307 [r1] (raw file):
nit: avoid space at end of lines


src/core/quadtree.js, line 335 [r1] (raw file):
This adds O(N logN) complexity at leaf level, also it constantly sorting element back and forward. I think it you will define limited set of sorters you can cache four instances of sorted array.


src/core/quadtree.js, line 373 [r1] (raw file):
(It will be cool if NodeIterator and LeafIterator were implemented using https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols api )


src/core/quadtree.js, line 379 [r1] (raw file):
you have to null it1/it0 once last item was found -- if tree is not well balanced, will we calling next() on all non-leaf nodes without a need


src/core/text_layout_evaluator.js, line 162 [r1] (raw file):
id shall be a string here to make !(c.id in this.deduper) efficient.


web/text_layer_builder.js, line 167 [r1] (raw file):
Not sure what does it mean, but adding unneeded DOM data might be a reason for performance hit. (See also #5221)


web/viewer.css, line 1274 [r1] (raw file):
Are those changes related to this PR? If yes, will it be a better place for them at text_layer_builder.css?


web/viewer.css, line 1280 [r1] (raw file):
Fixes #6614 ?


Comments from the review on Reviewable.io

@fdstevex
Copy link

For what it's worth, I applied this PR to the pdfjs we're using and it's a significant improvement in the usability of drag-selecting text. I understand it has the potential to affect performance, but in our case, with some large and complex PDFs, we've had no problems.

@mgol
Copy link
Contributor

mgol commented Jul 26, 2016

@timvandermeij Also, regarding enforcing a particular style, the following project may be helfpul: https://github.com/jscs-dev/node-jscs

@speedplane JSCS is deprecated now, it got merged with ESLint.

@Snuffleupagus
Copy link
Collaborator

Should this PR be closed in favour of #6663, given #6663 (comment)?

@speedplane
Copy link
Author

@Snuffleupagus I'm pretty sure that #6663 has a O(N^2) issue with respect to the number of divs on the page. I think you'll need a quadtree structure to do efficient spacial lookups and avoid that. Once that is added, I'm not sure how much simpler one implementation is than the other.

@timvandermeij
Copy link
Contributor

Closing as superseded by #7539. We file a follow-up issue for enabling it by default, for which we need to make some performance improvements. With those in place, we should have similar performance as in this PR, but with a simpler solution.

Nevertheless, thank you for working on this, though. It definitely inspired others to look into this issue and now get the most important part merged in the codebase!

@mitar
Copy link
Contributor

mitar commented Sep 1, 2016

I am really glad that we saw progress here.

@jlukic
Copy link

jlukic commented Jan 19, 2017

@mitar you are all over the web. Also congrats on the hardwork here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.