-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Text Selection By Padding Text Elements #5545
Conversation
Does this solution still require text fragments to be ordered in the text flow order? And what about if there are gaps between segments one cannot cover with a rectangle? So if you would make it too big, you would cover some other part, if you would make it too small, there would be a gap? |
Two answer @mitar's questions:
|
@speedplane Please squash the commits into one commit. See https://github.com/mozilla/pdf.js/wiki/Squashing-Commits on how to do this easily. Secondly, please make the code adhere to the style guide at https://github.com/mozilla/pdf.js/wiki/Style-Guide and make sure that linting passes by running |
@timvandermeij will do |
…t element so that they are placed next to eachother. Without any space between text elements, text selection is much more consistent.
The most recent commit is squashed and LINT tests now pass. |
Hm, there are still lots of commits in this PR. Maybe you forgot to force-push during the squashing process? |
Okay, I'll try that over again. But there are some additional changes I want to make as well. |
Sure. The easiest way is to make those changes first, commit them and then squash all commits into one commit using the method described on the wiki page. |
@mitar After a bit more investigation, it looks like my fix does not fix the text flow issue. I don't think it will be too hard to fix though. To clarify ... Selection appears to be dependent on both the padding AND the order of the text in a document. While my change creates padding that is order independent, it does not re-order the DOM elements, so it will not result in a nice flow. |
@mitar We can pretty easily re-order the text flow according to some metric, but I'm not sure what that metric would actually be. Take a look at the following image below. Assuming we were able to re-order the fragments any way we wanted, how would we do it? If you can think up an algorithm or heuristic, I can give it a try. |
Can you check how my approach is working with this document? |
@mitar How do I run your approach? The document I am referring to is located here |
You can import it into http://peerlibrary.org/ and open it there. |
@mitar I just tried it, and it doesn't work quite right. For example, on page 2 of the document, click in the whitespace between the line number 24 and the text "JUDGE GIANNETTI", and then drag down. If it was working properly, the line with JUDGE GIANNETTI would be highlighted and the 24 would not. However, it looks like the 24 is being highlighted. |
This change entirely breaks copy/paste. But then again, it looks like copy/paste is totally broken in |
@speedplane Copy-paste works fine for me for most PDFs, and Tracemonkey is no exception. It will be a problem if this PR breaks that. |
@timvandermeij You sure? Try copying and pasting the portion of text in TraceMonkey indicated in the image below. You get the following: There clearly is a problem with handling new lines and copy paste. |
@mitar Returning to the image below, how can you tell that the line numbers are not a skinny column of text and should be in a single text group? |
I think you should do some image segmentation. Searching for sections of text segments which are close to each other. |
I'm working on the following heuristic for horizontal lines. For each
Vertical lines are similar but reversed. This works with (1) non-column text with good text ordering, (2) column text with good text ordering, and (3) non-columned text with bad text ordering. It probably will not work for column text with bad text ordering, but I doubt any PDF reader works well with those (Adobe doesn't). |
@timvandermeij Take a look at how Adobe PDF copy/pastes the first two sentences of TraceMonkey:
Beyond using line-breaks, notice that they fixed the hyphenated comp-ile into a single word. Impressive. |
Phew... that was a lot of fast and furious coding. You can see the new viewer in action here: www.docketalarm.com/cases/PTAB/IPR2013-00142/Inter_Partes_Review_of_U.S._Pat._6931558/06-25-2014-PM-23445/Notice-50-Oral_Hearing_Transcript/?pdfjs Take a look at how selection works, also try loading in other documents in the viewer, I think it's working out okay. Some ongoing concerns: the O(N^2) issue, right-to-left text, rows of vertical text (not sure how often that comes up), and there is a small chance that this messes up column text selection (TraceMonkey and other Latex articles are fine). I'll get to squashing pretty soon, but I think this is a big improvement. |
Grr, not quite, if you take a look at page 4 of TraceMonkey with my new viewer, you can see there is a regression with the word "question" in the second column. I have an idea how to fix it. |
Checked it out. This N^2 is really killing it. I am not sure if padding segments is not better performance wise. But otherwise it works very well. I am impressed. I think that my issue was that I was using static padding on bottom and right, hoping that later text segments will always be over the previous and that didn't work. I still see that selection jumps up sometimes on edges between text segments, though. You can see that well if you move the mouse slowly while selecting. But I still think this is acceptable. But not everything works well, please look at this tricky PDF: https://peerlibrary.org/p/JCCuLbqynAPwwuos8 If you open left-top menu you will have a download link. Open it in your system and start selecting the title and then move down, below the title. You would expect that whole title gets selected, but in fact selection jumps up and selects header instead. See related ticket here: peerlibrary/peerlibrary#387 Have you tested this with papers with equations? |
@mitar I tried that example. The reason selection is messed up is because everything above the title is a header, and it occurs later on in the PDF despite being on top of the title. On the one hand, Adobe does the same thing we do now, so if we're good as Adobe, then we're arguably good enough. On the other hand, screw that. I have a potential solution for this floating in my head, but it will take another pass at all of the elements, possibly another N^2 pass. I'm not sure it's worthwhile. Regarding fixing the N^2 issue, that too is possible. We're basically performing a nearest neighbor search, and there are a variety of algorithms that are less than N^2 to do that. Implementing those is another story however. |
For me Adobe Reader 9 works well there. And Mac OS X Preview as well. So I think it is something we should support. |
One general solution would be that PDF.js would reorder text segments. This would allow also other improvements. Like when using PDF.js to extract text from the PDF, the text would come in reading flow. |
These changes do improve the text selection experience, but I did want to point out that the padding changes do complicate getting the coordinates of a text Range via getClientRects() as the rectangles include the extra padding in the dimensions. It doesn't directly impact the pdf viewer functionality, but any extensions people have made to do annotation or highlighting, etc could be impacted (like myself). |
Side note (coming from annotation word): maybe use open annotation standard and store annotations in text-form instead of coordinates. See http://hypothes.is/ as an example how it works with pdf.js. |
@mitar Thanks for the tip, that's a nice site. I don't want to hijack this thread too much, but for my specific use-case, I need the coordinates so I can later draw the highlights / annotations onto an actual PDF file itself later. I will experiment with that site some more and see if I can glean their approach better. I'm currently working on a workaround to try and subtract the padding out of the results of the getClientRects results, with mixed results. I believe there was a different approach you (@mitar) took for your site with dynamic padding. Do you know whether it would suffer the same issue. My thought is it wouldn't unless the mouse was still over the last text block when getting the coordinates... |
Just to close the loop (in case anyone else stumbles on this), I was able to work around this issue by simply walking the original selection range and creating sub-ranges only for the text nodes and then calling getClientRects on those sub ranges. This gives me the coordinates like before without the padding being included. Thanks |
I found and fixed two bugs. Here's a diff file to fix it: It fixes issues extending dyn padding to the top of page
|
What is required for this to be merged, other than the conflict resolution? Any tests missing? |
This needs to be reviewed and tested more, but since the main developers are currently on vacation, that might take some time. |
The other PRs improve the situation, but don't fix it. You still get flashing text, but it flashes downwards instead of upwards. When you select text and move your mouse into a blank area, the entire page gets selected. Instead, the user would expect that the selection advances line by line. This PR offers that smooth text selection experience. I have time to work on this more now if the team thinks it's eventually worth merging. |
@speedplane sounds good. I'm still considering this PR -- whole or in parts depending on risk we are willing to take. The #6590 reduces load on the algorithm in this PR.
Notice that currently Firefox has no such issue. Non-Firefox browser does not implement |
Yes, I took a look. #6590 seems to have solved the issue for Firefox, just not webkit. Very tempting to go with the simpler solution and hope that the others fix |
What's missing for me from HEAD is the vertical positioning in the text layer as outlined in the above comment. |
@nschloe Preview on Mac OSX has exact selection alignment for me -- looks like PDF might contain some weird font metrics (and Adobe Reader knows how to handle them). |
@yurydelendik I'm running on pdfjs-dist 1.2.83 and still see the issue for some PDFs. -- Perhaps the fix is not in there yet? Anyhow, not only Adobe Reader had that fixed, but also the OP's PR. |
I'm afraid you were not specific enough about issues you see. I would recommend opening issues for both of these PDF and describe in details what you see (see https://github.com/mozilla/pdf.js/blob/master/CONTRIBUTING.md). Proposed solution here adds the padding to the text (were that positioned wrong or right). The grouping PR6590 is trying to fix positioning and reduce amount of text runs we will be dealing here. The user-select PR6591 is disabling selection of non-text on the page. If the last two PRs broke the behavior of some PDFs we would like to know that. However that is off-topic of this discussion. |
@yurydelendik Right on, right on, I'll open a new issue for it. |
I reviewed most of the patch, however I stopped at text_layer_builder.js -- the text layer builder changes needs to be tested and I'm blocking this PR by #6619 . More review will be need once review comments are addressed. With accepting of this PR we will take a big performance/memory hit due to DOM elements having more data and probably override PR #5221 improvement (so I also block this by testing it with PDF files of fixed issues, e.g. #4626). Due to that (and having simpler solution for FF at #6591), we want to pref this algorithm off and, probably, move it back to src/display/ namespace. Thank you. Reviewed 7 of 9 files at r1. src/core/quadtree.js, line 12 [r1] (raw file): src/core/quadtree.js, line 21 [r1] (raw file): src/core/quadtree.js, line 52 [r1] (raw file): src/core/quadtree.js, line 58 [r1] (raw file): src/core/quadtree.js, line 70 [r1] (raw file): src/core/quadtree.js, line 93 [r1] (raw file): src/core/quadtree.js, line 97 [r1] (raw file): src/core/quadtree.js, line 160 [r1] (raw file): src/core/quadtree.js, line 190 [r1] (raw file): src/core/quadtree.js, line 247 [r1] (raw file): src/core/quadtree.js, line 260 [r1] (raw file): src/core/quadtree.js, line 307 [r1] (raw file): src/core/quadtree.js, line 335 [r1] (raw file): src/core/quadtree.js, line 373 [r1] (raw file): src/core/quadtree.js, line 379 [r1] (raw file): src/core/text_layout_evaluator.js, line 162 [r1] (raw file): web/text_layer_builder.js, line 167 [r1] (raw file): web/viewer.css, line 1274 [r1] (raw file): web/viewer.css, line 1280 [r1] (raw file): Comments from the review on Reviewable.io |
For what it's worth, I applied this PR to the pdfjs we're using and it's a significant improvement in the usability of drag-selecting text. I understand it has the potential to affect performance, but in our case, with some large and complex PDFs, we've had no problems. |
@speedplane JSCS is deprecated now, it got merged with ESLint. |
Should this PR be closed in favour of #6663, given #6663 (comment)? |
@Snuffleupagus I'm pretty sure that #6663 has a O(N^2) issue with respect to the number of divs on the page. I think you'll need a quadtree structure to do efficient spacial lookups and avoid that. Once that is added, I'm not sure how much simpler one implementation is than the other. |
Closing as superseded by #7539. We file a follow-up issue for enabling it by default, for which we need to make some performance improvements. With those in place, we should have similar performance as in this PR, but with a simpler solution. Nevertheless, thank you for working on this, though. It definitely inspired others to look into this issue and now get the most important part merged in the codebase! |
I am really glad that we saw progress here. |
@mitar you are all over the web. Also congrats on the hardwork here. |
The Problem
Text selection in pdf.js does not work so well. If while dragging, your cursor is not directly on top of a text element, then you get unexpected results. This occurs with nearly all examples, including the tracemonkey example. Try starting a selection, then dragging the mouse to a blank whitespace area beneath the initial selection. You'll see the selection gets reversed. If you keep moving your mouse, it flickers and is generally a horrible experience. I'm not the first to report this, it is the subject of at least #4629 and #4843.
The Solution
The reason why selection doesn't work is because most browsers rely on padding between text to detect text flow. The solution, therefore, is to increase padding of each text element so that each element runs right up next to the next element.
Other Solutions
Adding padding to each text element was recognized as a possible solution by @mitar in #4843. However, in #4843, the suggested solution was relatively complex, which included modifying the padding dynamically whenever a mouse moves. It was also written in CoffeeScript for a completely separate viewer and porting it seemed difficult.
This Solution
In this solution, the padding is calculated when the text elements are added. For each added text element, we find the nearest neighbor to the right that is on the same line. We then set the
padding-right
of the text element to that distance. Similarly, forpadding-bottom
we find the next line that is underneath the text element, and set the padding to the difference between lines. Settingpadding-bottom
is especially helpful for PDFs that are double spaced (lots of space between lines).The solution isn't all milk and honey though. The computation is
O(N^2)
, which could be bad for complex PDFs. There are fancier algorithms to find the nearest neighbor, but for now, for each text element, we look at every other text element to see if it's nearest.UPDATE: As discussed below, the
O(N^2)
issue in this pull request has been resolved by using a quadtree datastructure.