Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract text bounding box coordinates? #204

Closed
hoangthanh283 opened this issue Apr 11, 2023 · 1 comment
Closed

How to extract text bounding box coordinates? #204

hoangthanh283 opened this issue Apr 11, 2023 · 1 comment
Assignees
Labels
conversation question A user needs help or further information

Comments

@hoangthanh283
Copy link

hoangthanh283 commented Apr 11, 2023

Thanks for your work!
I tried to look at the document to find APIs to extract text bounding boxes but I could not. So I wonder do we support extracting text bounding boxes or not.

I tried with:

searcher = textpage.search("something", match_case=False, match_whole_word=False)
first_occurrence = searcher.get_next()

But it returns a tuple (int, int) (Start character index and count of the next occurrence) instead of a list of bounding boxes of the form (left, bottom, right, top) that is mentioned in README.

@mara004 mara004 added question A user needs help or further information conversation labels Apr 11, 2023
@mara004 mara004 self-assigned this Apr 11, 2023
@mara004
Copy link
Member

mara004 commented Apr 11, 2023

So I wonder do we support extracting text bounding boxes or not.

Yes we do.

You'll want the PdfTextPage API, notably count_rects() and get_rect().
Sorry about the outdated readme comment, that API changed with v4. I'll fix that.

I guess you only looked at the readme and missed the docs on RTD, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conversation question A user needs help or further information
Projects
None yet
Development

No branches or pull requests

2 participants