Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading text contents of page #93

Closed
matthopson opened this issue Apr 15, 2019 · 3 comments
Closed

Reading text contents of page #93

matthopson opened this issue Apr 15, 2019 · 3 comments

Comments

@matthopson
Copy link

Hi, thanks for working on this project. It has met my needs beautifully with one exception (and it's probably a lack of understanding on my part).

I couldn't find a very intuitive way to get the contents of a page and verify its text content.

While generating a new page and inserting it into a document was very straight-forward, I'd like to also test this functionality, including that the expected contents end up on the page (it's dynamically generated). So when writing a test, I'd like to create a page, insert several lines of text, and then bring that page back in to verify that the expected lines of text exist on that page.

Am I overlooking something obvious, or are we lacking this functionality in a straight-forward way?

Thanks!

@Hopding
Copy link
Owner

Hopding commented Apr 16, 2019

Hello @matthopson. pdf-lib is primarily focused on creating and editing PDFs right now. It does not currently have functionality to extract text content from them. Though, this is functionality I've considered adding at some point in the future.

For your use case, I'd suggest using pdf.js to extract text from the documents you create/modify with pdf-lib. pdf.js is a library specifically designed to extract text, images, etc... from PDFs for rendering. here's an example of using it in Node.

Let me know if you have any further questions!

@Hopding Hopding closed this as completed Apr 16, 2019
@matthopson
Copy link
Author

Thanks for the response. I had considered this, but was hoping to not have to use two separate PDF libraries to do this, but it sounds like that's my best bet for the time being.

Thanks!

@themaxempire23
Copy link

Hi, thank you for working on this cool library, what a team.

I would like to find out if its possible to use pdf-lib to get specic text from a pdf file using coordinates, as in were the specific text is on the page?

I'm working on a simple feature in a react ocr(optical character recorgnition using tesseract) app, with node js and espress as the server, my goal is for a user to simply upload a scanned pdf and a specific number is extracted from the document.

looking forward to your cool response

kind regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants