Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving annotations API to get access to all annotations stored in the PDF #5283

Closed
mitar opened this issue Sep 9, 2014 · 10 comments
Closed

Comments

@mitar
Copy link
Contributor

mitar commented Sep 9, 2014

Currently, support for annotations is incomplete. Along with missing support for some types of annotations, there is also incomplete support for accessing information about annotations through API in the first place.

I am interested in using PDF.js to convert annotations stored in PDFs into open annotation standard, used by new W3C web annotations working group. For that I would be interested in having a PDF.js API which would return all annotations and highlights stored in the PDF. Even if they are not supported by PDF.js rendering them, they could at least be returned for consumption through API.

In particular, the issues I am observing (see this example PDF) are:

  • highlights made through Mac OS X Preview are rendered no matter what, they are part of the page rendering process and cannot be turned on or off as wanted Don't render highlights done in Preview by default #5252
  • highlights are not returned through annotations API, in fact there is completely no way to access them, even in annotations in page rendering they are not available; it seems they have to be found and processed specially and added to the annotations API output
  • best would be if highlights could be returned through API with text quote which they are selecting already available, and some other information about the position of the text on the page; maybe indexes of text layer elements overlapping with the highlight or something like that; this is needed for conversion to open annotation standard; so some way to use highlighting information as returned through API (maybe in combination with text layer information) to determine the selected text content and position on the page
  • for all other annotations API should return all known information, even if PDF.js does know how to handle a particular annotation type/information
  • in particular, information about links between annotations should be provided: is there a link between a highlight and annotation made next to it? is there a link between annotation which displays an icon and annotation which is then a text content above that icon? (it seems that something like this is already available, but it was unclear for how to get it through API)
@mitar
Copy link
Contributor Author

mitar commented Sep 9, 2014

cc @HeXXiiiZ

@hags37
Copy link

hags37 commented Feb 19, 2015

do we have any resolutions for this?

@mitar
Copy link
Contributor Author

mitar commented Feb 19, 2015

Somebody has to implement it. :-)

@wscalf
Copy link

wscalf commented Aug 4, 2015

I see that it's not assigned - do we know if anyone is working on it?

I have a personal interest in the annotations API describing form elements more completely and have been digging around in the AcroForms section of the PDF spec and pdf.js code lately. I might be able to contribute.

@timvandermeij
Copy link
Contributor

I'm working on the annotation layer to refactor it (see https://github.com/mozilla/pdf.js/commits/master/src/core/annotation.js for an idea of the kind of patches I make), but I'm not touching the API for that. Feel free to work on this issue and create a PR once you have a working version to get early feedback.

@timvandermeij
Copy link
Contributor

@mitar By the way, doesn't https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L743 at least partially do what you want? Maybe you were already familiar with it; in that case you can ignore my comment.

@mitar
Copy link
Contributor Author

mitar commented Nov 10, 2015

No, this is what this ticket is about. I listed above the limitations of current API. It does not give you access to all annotations and all their properties. It seems like API provides only things which are rendered by pdf.js or used by it. But like things other apps add are not available. Despite it seems being a normal PDF standard.

@jlegewie
Copy link

jlegewie commented Apr 8, 2016

best would be if highlights could be returned through API with text quote which they are selecting already available, and some other information about the position of the text on the page; maybe indexes of text layer elements overlapping with the highlight or something like that; this is needed for conversion to open annotation standard; so some way to use highlighting information as returned through API (maybe in combination with text layer information) to determine the selected text content and position on the page

This is something I have worked on for a long time. I have a modified pdf.js version that supports this feature here (sorry, the fork is a mess) but it's not based on the current pdf.js version. I use it in zotfile to extract highlighted text from pdf files. It would be great if the API supports getting the annotation text directly but I am not sure whether it's in the scope of pdf.js

@bjohas
Copy link

bjohas commented Dec 7, 2018

Hello, I'm also interested in this. What kind of annotations can PDF.js extract? E.g. can the comment text be extracted too? Can the colour be extracted?

@Snuffleupagus
Copy link
Collaborator

At this point in time we support a lot more Annotation types, compared to when this issue was opened (7 years ago), and for any unsupported types we'll return "generic" Annotation-data. Hence all Annotations should now, at least to some extent, be accessible through the API (and we cannot return arbitrary unverified data for unsupported Annotations).

Given the age of this issue, and that #5283 (comment) mentions a bunch of different things, it doesn't seem useful to keep this open any more. If there's still specific issues encountered, please open a new issue for each problem observed; see also https://github.com/mozilla/pdf.js/blob/master/.github/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants