Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need coordinate conversion help #228

Closed
samshelley opened this issue Jun 19, 2023 · 15 comments
Closed

Need coordinate conversion help #228

samshelley opened this issue Jun 19, 2023 · 15 comments
Labels
api Possible changes to the public API documentation Improvements or additions to documentation enhancement New feature or request pdfium This issue may be caused by (or related to) pdfium itself

Comments

@samshelley
Copy link

samshelley commented Jun 19, 2023

Firstly, thanks so much for creating this great library @mara004

I read #204 & especially #214 but neither seem to answer my question. I'm looking to get bounding boxes for text as percentages of the canvas found via search. I execute the search using the textpage.search method to get the starting index. Then I loop through and use get_charbox with the loose option to build my bounding boxes as seen in the snippet below:

index, count = searcher.get_next()
for i in range(count):
    left, bottom, right, top = textpage.get_charbox(index + i, loose=True)
    c_width = right - left
    c_height = bottom - top
    bounding_boxes.append(dict(
        left=(left/width)*100,
        top=(top/height)*100,
        width=c_width/width*100,
        height=c_height/height*100,
        page=page_index+1 # there's an outer loop for each page
  ))

This almost works, but I'm noticing two broken behaviors that seem potentially related:

  1. The top value seems to be greater than the bottom value, which I think doesn't make sense in a coordinate system?
  2. The calculated values as percentages for width & left are exactly correct, but the values for top & height are not -- they are off by a bit.

As a way to compare and check my logic here, I opened the pdf in Mac Preview and drew a rectangle in approximately the same area of the PDF that I was looking to extract. Here, the left/right values were again accurate, but top & bottom were off by ~40-60 canvas units.

Do you have any recommendations here? Am I using the APIs incorrectly? Apologies if this was already answered elsewhere or is included in the documentation.

Thanks so much for taking a look. If you need me to provide a full working example with an attached pdf I can do that as well, just wanted to see if it was something obvious first.

@samshelley
Copy link
Author

samshelley commented Jun 20, 2023

I just solved my own issue, while researching Apache PDFBox: https://stackoverflow.com/a/54045861.

I think you sort of alluded to it in other support tickets, but I had to convert from x,y in the bottom left hand corner of the document to x,y in the top left.

I'm going to leave it open only as I think the docs would benefit from a brief section explaining how to convert "PDF Canvas Units" to typical x/y coordinate space. Feel free to close if you disagree!

@mara004
Copy link
Member

mara004 commented Jun 20, 2023

Hi, nice to hear you essentially figured out already.

Yes, in PDF, the coordinate system's origin is typically the bottom left corner (unlike top left for bitmaps), though in theory the PDF spec allows the coordinate system to be laid out between any opposite corners (I think, anyway).

As you say, comments #214 (comment) and #214 (comment) kind of discuss that already.

As this seems to be a common problem, I suppose you're right the docs would deserve a section on coordinate conversion. Maybe even some support model around FPDF_PageToDevice() / FPDF_DeviceToPage().
I'll need some time to consider, though.

@mara004 mara004 changed the title Using get_charbox, bottom & top values seem to be inaccurate Add help regarding coordinate conversion Jun 20, 2023
@samshelley
Copy link
Author

samshelley commented Jun 20, 2023

Thanks! Re-reading those comments it's clear in retrospect, I just didn't grasp it the first time.

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Or is the easiest solution just to use the raw APIs for each coordinate pair in the bounding box since it will handle it reliably?

@mara004
Copy link
Member

mara004 commented Jun 20, 2023

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Yes, that's what I meant.
I think rotation is probably not the only additional case, though (there's the aforementioned "any opposite corners" problem, for one thing), and I'd indeed recommend to call these raw API functions since that's what seems safest/easiest.

@samshelley
Copy link
Author

samshelley commented Jun 20, 2023

Got it! I'm very unfamiliar with ctypes, but based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

FPDF_EXPORT FPDF_BOOL FPDF_CALLCONV FPDF_PageToDevice(FPDF_PAGE page,
--
  | int start_x,
  | int start_y,
  | int size_x,
  | int size_y,
  | int rotate,
  | double page_x,
  | double page_y,
  | int* device_x,
  | int* device_y);

Am I understanding this incorrectly?

If so, the logic for the method in FPDF_PageToDevice turns out to not actually be that complicated so if that method doesn't work I'll likely just come back to this later and re-implement it in python using the helper methods you made for PDFMatrix.

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

@mara004
Copy link
Member

mara004 commented Jun 20, 2023

based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

Ooh, yes. If you're not actually targeting a bitmap to draw on, that sounds like a problem.
I guess you can use a large bitmap and then downscale so you don't run into real precision trouble, but yes, that's inelegant. Need to think about this...

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

Sadly the CPDF_* API layer is pdfium's private C++ backend which we can't access with ABI bindings / ctypes.
This is an unfortunate but known limitation of our (one could say, quick and dirty) bindings concept :(

@samshelley
Copy link
Author

samshelley commented Jun 20, 2023

OK thank you! This has been incredibly helpful -- really appreciate the pointers. Yes I think I'm doing something a bit different than others here (but it does work!)

GetDisplayMatrix is actually really simple as well so we've solved my issue for now -- https://pdfium.googlesource.com/pdfium.git/+/798e18f5e5cfb672c7f3186f6358b84c5ff7785b/core/fpdfapi/page/cpdf_page.cpp

@mara004
Copy link
Member

mara004 commented Jun 20, 2023

That's good to hear, thanks!

However, I'm still left to think what I should do with pypdfium2 now.
And I'm sort of wondering why you want to change coordinate representation if you don't actually work with device pixels?

@samshelley
Copy link
Author

samshelley commented Jun 20, 2023

I am rendering a "highlight" layer in a web interface to highlight specific text in a displayed pdf. The rendering engine uses percentage values to determine where to place items so I need to use the right coordinate space.

I'm fairly new to all of this so honestly not sure if my suggestion is too narrow....but as far as what would be helpful to my use-case, if you had a python API implementation of FPDF_PageToDevice that maintained precision, I would 100% use that instead of what I'm likely to implement when I come back to this. But this also might be too narrow a use-case, so just a note somewhere in the docs that explains PDF coordinate space (and then a reference to it in the API docs for all of the methods that return coordinates) would have also been totally sufficient!

@mara004
Copy link
Member

mara004 commented Jun 20, 2023

I see, thank you for elaborating.

Maybe, as an alternative to a python re-implementation, we could ask pdfium to add a float equivalent of FPDF_PageToDevice()? We don't need any bitmap parameters, just two functions for (almost-)lossless back and forth translation between normalized and native PDF coordinates.

@samshelley
Copy link
Author

That would work perfectly!

@mara004 mara004 added documentation Improvements or additions to documentation enhancement New feature or request pdfium This issue may be caused by (or related to) pdfium itself minor Low importance api Possible changes to the public API labels Jun 24, 2023
@mara004
Copy link
Member

mara004 commented Aug 8, 2023

Commit a379ecc (in the devel branch) adds a helper around FPDF_PageToDevice() / FPDF_DeviceToPage(), but only to translate between a page and a corresponding bitmap rendering.

The quest for float coordinate normalization still stands.

@samshelley
Copy link
Author

Thanks for the update!

@mara004
Copy link
Member

mara004 commented Aug 11, 2023

Our docs often mention coordinate order, such as left, bottom, right, top for rectangle return.
That feels problematic. At least we should add something like "relative to the PDF coordinate system".
Or maybe we should avoid these terms entirely and use unspecific variable names instead, e.g. x0, y0, x1, y1?

@mara004 mara004 removed the minor Low importance label Aug 25, 2023
@mara004 mara004 changed the title Add help regarding coordinate conversion Need coordinate conversion help Dec 7, 2023
@mara004
Copy link
Member

mara004 commented Dec 7, 2023

I think I'll convert this to a discussion, because I figured I don't think it a good idea to implement coordinate conversion from scratch in pypdfium2 (nor would I have the time to do so). Especially given there is FPDF_PageToDevice() / FPDF_DeviceToPage() already, which covers the main use case.

However, to any users affected, feel free to file a feature request at pdfium for float coordinate normalization (or perhaps even contribute a patch yourself).

@pypdfium2-team pypdfium2-team locked and limited conversation to collaborators Dec 7, 2023
@mara004 mara004 converted this issue into discussion #284 Dec 7, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
api Possible changes to the public API documentation Improvements or additions to documentation enhancement New feature or request pdfium This issue may be caused by (or related to) pdfium itself
Projects
None yet
Development

No branches or pull requests

2 participants