Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViewPage: ignore AlternativeImage if not retrievable #37

Closed
bertsky opened this issue Jan 18, 2022 · 6 comments
Closed

ViewPage: ignore AlternativeImage if not retrievable #37

bertsky opened this issue Jan 18, 2022 · 6 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Jan 18, 2022

Currently, ocrd_browser.view.page tries to add an image version for each AlternativeImage referenced in the page. But that can lead to an uncaught FileNotFoundError during

Image.open(path).size,
if the file happens to be missing.

Furthermore, even when I catch this, if the missing path is the last image, it resurfaces as preferred version during

page_image, page_coords, page_image_info = ws.image_from_page(self.page, self.id, transparency=True, feature_selector=feature_selector, feature_filter=feature_filter)
, because the selection mechanism for image_from_page is not by file path but by features.

IMHO this should be more robust: An image file reference (irrespective if it is derived or original) that is nowhere to be found in the filesystem should simply be rendered with an all-white canvas of the same size.

@bertsky
Copy link
Contributor Author

bertsky commented Jan 18, 2022

To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features). Something like Workspace.image_from_page(... filename=None ...) and similar for image_from_segment. Of course, both parameters might interact, so if both filename and feature_filter / feature_selector are given, then one would have to verify that the image with that filename also fulfills these features.

What do you think, @kba?

@bertsky
Copy link
Contributor Author

bertsky commented Jan 18, 2022

Oh, it's not as easy though: Even with a @filename filter, we cannot just use that to configure our ImageVersionSelector, because obviously the path name changes for each page. @hnesk IIUC your approach was to use the selector only to retrieve its image features, and then pass that on to get_image. But what if the image with the right features does not exist and cannot be downloaded? For core, we would have to try and download each AlternativeImage to rule that out (which seems like a waste of resources)...

@hnesk
Copy link
Owner

hnesk commented Jan 21, 2022

To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features).

I build the ImageVersionSelector around the notion of features, because Workspace.image_from_page expects it that way, if there would a filename filter, I would happily use that.

Oh, it's not as easy though: Even with a @filename filter, we cannot just use that to configure our ImageVersionSelector, because obviously the path name changes for each page.

No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.

I think your fix in #38 fixes the problem already. Do you have an example workspace?

@bertsky
Copy link
Contributor Author

bertsky commented Jan 21, 2022

if there would a filename filter, I would happily use that.

No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.

Oh, in that case – we should go for it. ocrd_browser would not have to mess with the image feature mechanism. See https://github.com/bertsky/core/tree/workspace-altimg-retrieve-existing for the preliminaries in core. The next step would be to (wait for approval+merge in core) and then utilize it in get_image.

I think your fix in #38 fixes the problem already. Do you have an example workspace?

No, only the first half – see above. For an example, it suffices to take any valid PAGE-XML and add an AlternativeImage to it with no @comments and a non-existing @filename.

@hnesk
Copy link
Owner

hnesk commented Apr 28, 2022

Ok, I got it now! It really is necessary to distinguish the images by filename, so it would be great if your branch gets merged.

@bertsky
Copy link
Contributor Author

bertsky commented Apr 28, 2022

Ok, so let's wait for OCR-D/core#845

@hnesk hnesk closed this as completed in b19ae33 May 10, 2022
hnesk added a commit that referenced this issue May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants