Need coordinate conversion help #228

samshelley · 2023-06-19T22:39:08Z

Firstly, thanks so much for creating this great library @mara004

I read #204 & especially #214 but neither seem to answer my question. I'm looking to get bounding boxes for text as percentages of the canvas found via search. I execute the search using the textpage.search method to get the starting index. Then I loop through and use get_charbox with the loose option to build my bounding boxes as seen in the snippet below:

index, count = searcher.get_next()
for i in range(count):
    left, bottom, right, top = textpage.get_charbox(index + i, loose=True)
    c_width = right - left
    c_height = bottom - top
    bounding_boxes.append(dict(
        left=(left/width)*100,
        top=(top/height)*100,
        width=c_width/width*100,
        height=c_height/height*100,
        page=page_index+1 # there's an outer loop for each page
  ))

This almost works, but I'm noticing two broken behaviors that seem potentially related:

The top value seems to be greater than the bottom value, which I think doesn't make sense in a coordinate system?
The calculated values as percentages for width & left are exactly correct, but the values for top & height are not -- they are off by a bit.

As a way to compare and check my logic here, I opened the pdf in Mac Preview and drew a rectangle in approximately the same area of the PDF that I was looking to extract. Here, the left/right values were again accurate, but top & bottom were off by ~40-60 canvas units.

Do you have any recommendations here? Am I using the APIs incorrectly? Apologies if this was already answered elsewhere or is included in the documentation.

Thanks so much for taking a look. If you need me to provide a full working example with an attached pdf I can do that as well, just wanted to see if it was something obvious first.

The text was updated successfully, but these errors were encountered:

samshelley · 2023-06-20T00:57:26Z

I just solved my own issue, while researching Apache PDFBox: https://stackoverflow.com/a/54045861.

I think you sort of alluded to it in other support tickets, but I had to convert from x,y in the bottom left hand corner of the document to x,y in the top left.

I'm going to leave it open only as I think the docs would benefit from a brief section explaining how to convert "PDF Canvas Units" to typical x/y coordinate space. Feel free to close if you disagree!

mara004 · 2023-06-20T10:52:43Z

Hi, nice to hear you essentially figured out already.

Yes, in PDF, the coordinate system's origin is typically the bottom left corner (unlike top left for bitmaps), though in theory the PDF spec allows the coordinate system to be laid out between any opposite corners (I think, anyway).

As you say, comments #214 (comment) and #214 (comment) kind of discuss that already.

As this seems to be a common problem, I suppose you're right the docs would deserve a section on coordinate conversion. Maybe even some support model around FPDF_PageToDevice() / FPDF_DeviceToPage().
I'll need some time to consider, though.

samshelley · 2023-06-20T11:10:26Z

Thanks! Re-reading those comments it's clear in retrospect, I just didn't grasp it the first time.

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Or is the easiest solution just to use the raw APIs for each coordinate pair in the bounding box since it will handle it reliably?

mara004 · 2023-06-20T11:27:43Z

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Yes, that's what I meant.
I think rotation is probably not the only additional case, though (there's the aforementioned "any opposite corners" problem, for one thing), and I'd indeed recommend to call these raw API functions since that's what seems safest/easiest.

samshelley · 2023-06-20T11:35:29Z

Got it! I'm very unfamiliar with ctypes, but based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

FPDF_EXPORT FPDF_BOOL FPDF_CALLCONV FPDF_PageToDevice(FPDF_PAGE page,
--
  | int start_x,
  | int start_y,
  | int size_x,
  | int size_y,
  | int rotate,
  | double page_x,
  | double page_y,
  | int* device_x,
  | int* device_y);

Am I understanding this incorrectly?

If so, the logic for the method in FPDF_PageToDevice turns out to not actually be that complicated so if that method doesn't work I'll likely just come back to this later and re-implement it in python using the helper methods you made for PDFMatrix.

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

mara004 · 2023-06-20T11:46:20Z

based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

Ooh, yes. If you're not actually targeting a bitmap to draw on, that sounds like a problem.
I guess you can use a large bitmap and then downscale so you don't run into real precision trouble, but yes, that's inelegant. Need to think about this...

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

Sadly the CPDF_* API layer is pdfium's private C++ backend which we can't access with ABI bindings / ctypes.
This is an unfortunate but known limitation of our (one could say, quick and dirty) bindings concept :(

samshelley · 2023-06-20T11:52:50Z

OK thank you! This has been incredibly helpful -- really appreciate the pointers. Yes I think I'm doing something a bit different than others here (but it does work!)

GetDisplayMatrix is actually really simple as well so we've solved my issue for now -- https://pdfium.googlesource.com/pdfium.git/+/798e18f5e5cfb672c7f3186f6358b84c5ff7785b/core/fpdfapi/page/cpdf_page.cpp

mara004 · 2023-06-20T12:11:37Z

That's good to hear, thanks!

However, I'm still left to think what I should do with pypdfium2 now.
And I'm sort of wondering why you want to change coordinate representation if you don't actually work with device pixels?

samshelley · 2023-06-20T12:18:13Z

I am rendering a "highlight" layer in a web interface to highlight specific text in a displayed pdf. The rendering engine uses percentage values to determine where to place items so I need to use the right coordinate space.

I'm fairly new to all of this so honestly not sure if my suggestion is too narrow....but as far as what would be helpful to my use-case, if you had a python API implementation of FPDF_PageToDevice that maintained precision, I would 100% use that instead of what I'm likely to implement when I come back to this. But this also might be too narrow a use-case, so just a note somewhere in the docs that explains PDF coordinate space (and then a reference to it in the API docs for all of the methods that return coordinates) would have also been totally sufficient!

mara004 · 2023-06-20T12:49:09Z

I see, thank you for elaborating.

Maybe, as an alternative to a python re-implementation, we could ask pdfium to add a float equivalent of FPDF_PageToDevice()? We don't need any bitmap parameters, just two functions for (almost-)lossless back and forth translation between normalized and native PDF coordinates.

samshelley · 2023-06-20T13:11:29Z

That would work perfectly!

mara004 · 2023-08-08T23:54:09Z

Commit a379ecc (in the devel branch) adds a helper around FPDF_PageToDevice() / FPDF_DeviceToPage(), but only to translate between a page and a corresponding bitmap rendering.

The quest for float coordinate normalization still stands.

samshelley · 2023-08-11T11:48:09Z

Thanks for the update!

mara004 · 2023-08-11T18:57:24Z

Our docs often mention coordinate order, such as left, bottom, right, top for rectangle return.
That feels problematic. At least we should add something like "relative to the PDF coordinate system".
Or maybe we should avoid these terms entirely and use unspecific variable names instead, e.g. x0, y0, x1, y1?

mara004 · 2023-12-07T23:43:09Z

I think I'll convert this to a discussion, because I figured I don't think it a good idea to implement coordinate conversion from scratch in pypdfium2 (nor would I have the time to do so). Especially given there is FPDF_PageToDevice() / FPDF_DeviceToPage() already, which covers the main use case.

However, to any users affected, feel free to file a feature request at pdfium for float coordinate normalization (or perhaps even contribute a patch yourself).

mara004 changed the title ~~Using get_charbox, bottom & top values seem to be inaccurate~~ Add help regarding coordinate conversion Jun 20, 2023

mara004 added documentation Improvements or additions to documentation enhancement New feature or request pdfium This issue may be caused by (or related to) pdfium itself minor Low importance api Possible changes to the public API labels Jun 24, 2023

mara004 mentioned this issue Jul 3, 2023

Correct way to deal with rotated documents for text extraction? #234

Closed

mara004 removed the minor Low importance label Aug 25, 2023

mara004 changed the title ~~Add help regarding coordinate conversion~~ Need coordinate conversion help Dec 7, 2023

pypdfium2-team locked and limited conversation to collaborators Dec 7, 2023

mara004 converted this issue into discussion #284 Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Need coordinate conversion help #228

Need coordinate conversion help #228

samshelley commented Jun 19, 2023 •

edited by mara004

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023

mara004 commented Aug 8, 2023 •

edited

Loading

samshelley commented Aug 11, 2023

mara004 commented Aug 11, 2023 •

edited

Loading

mara004 commented Dec 7, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Need coordinate conversion help #228

Need coordinate conversion help #228

Comments

samshelley commented Jun 19, 2023 • edited by mara004 Loading

samshelley commented Jun 20, 2023 • edited Loading

mara004 commented Jun 20, 2023

samshelley commented Jun 20, 2023 • edited Loading

mara004 commented Jun 20, 2023 • edited Loading

samshelley commented Jun 20, 2023 • edited Loading

mara004 commented Jun 20, 2023 • edited Loading

samshelley commented Jun 20, 2023 • edited Loading

mara004 commented Jun 20, 2023 • edited Loading

samshelley commented Jun 20, 2023 • edited Loading

mara004 commented Jun 20, 2023 • edited Loading

samshelley commented Jun 20, 2023

mara004 commented Aug 8, 2023 • edited Loading

samshelley commented Aug 11, 2023

mara004 commented Aug 11, 2023 • edited Loading

mara004 commented Dec 7, 2023 • edited Loading

This issue was moved to a discussion.

samshelley commented Jun 19, 2023 •

edited by mara004

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

samshelley commented Jun 20, 2023 •

edited

Loading

mara004 commented Jun 20, 2023 •

edited

Loading

mara004 commented Aug 8, 2023 •

edited

Loading

mara004 commented Aug 11, 2023 •

edited

Loading

mara004 commented Dec 7, 2023 •

edited

Loading