Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Image OCR #228

Merged
merged 49 commits into from
Nov 2, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
cd804db
Initial work on OCR processing. Add a new function, hooked to the gen…
Oct 1, 2020
a73f017
Add in setting to turn OCR on/off. Add recan functionality to media m…
Oct 1, 2020
a53b66d
Add helper function to find image that matches both file size and dim…
Oct 1, 2020
5ffc932
Don't show OCR rescan options if that setting is turned off. Comment …
Oct 2, 2020
84aac91
Better error handling for failed requests. Add callback to OCR rescan…
Oct 2, 2020
93c8693
Make sure we return proper values from our generate functions, to avo…
Oct 2, 2020
cca2e0b
Move OCR scanning into the same function that does other image scanni…
Oct 2, 2020
10c31a8
Don't store full response unless we have actual text to save
Oct 2, 2020
ccf9760
Update docblocks
Oct 15, 2020
02425b5
Don't remove the OCR scan button if OCR is turned off. Turning OCR of…
Oct 15, 2020
e2bff1e
Revert the extra error handling added here, as that's now in a separa…
Oct 16, 2020
4f2b000
feat: auto insert description generated by ocr after image block
dinhtungdu Oct 17, 2020
3894f2f
fix: only enqueue editor script if ocr is enabled
dinhtungdu Oct 20, 2020
c3ac120
Merge branch 'develop' into feature/azure-image-ocr
Oct 20, 2020
fe46509
Update OCR endpoint to the newly release v3.1
Oct 20, 2020
b6df398
feat: OCR modal and OCR sidebar button
dinhtungdu Oct 23, 2020
1d7aa99
Fix tests
Oct 23, 2020
dc2ef3a
If we modify the content using DOMDocument, make sure we account for …
Oct 23, 2020
ab9705b
Switch to a different approach to handle encoding issues, to match wh…
Oct 23, 2020
b7d4e34
fix: typo
dinhtungdu Oct 24, 2020
ceb8c7b
fix: only show modal if image is scanned by OCR
dinhtungdu Oct 24, 2020
9094d20
fix: only show sidebar button if image has ocr text
dinhtungdu Oct 24, 2020
4682ddd
try: replace sidebar button by toolbar button
dinhtungdu Oct 24, 2020
98f17fb
fix: update toolbar icon
dinhtungdu Oct 27, 2020
81f7874
Merge branch 'develop' into feature/azure-image-ocr
jeffpaul Oct 27, 2020
9915bed
fix: ocr status in media api response
dinhtungdu Oct 28, 2020
88fac65
Merge branch 'feature/azure-image-ocr' of github.com:10up/classifai i…
dinhtungdu Oct 28, 2020
0f536b2
Merge branch 'develop' into feature/azure-image-ocr
helen Oct 29, 2020
3b13b2b
Make sure the OCR manual scan button is added independently of the sm…
Oct 29, 2020
7e5176a
Make sure we don't show the scan buttons on the single edit view, onl…
Oct 29, 2020
1a85611
Insert a verse block instead of a paragraph.
helen Oct 29, 2020
0e50784
Go back to paragraph block for now.
helen Oct 29, 2020
dd0e8bb
fix: use group block for scanned text block
dinhtungdu Oct 29, 2020
b44ca65
ci: use composer v1
dinhtungdu Oct 29, 2020
c07a1ed
fix: only allow one ocr block per image
dinhtungdu Oct 29, 2020
d69dd98
Create block style for group for editor styling purposes
helen Oct 29, 2020
be0ab73
Commas, sigh.
helen Oct 29, 2020
7ee5b72
Better CSS for highlight border
helen Oct 30, 2020
94abde8
Highlight related image/OCR block when editor is focused on the other.
helen Oct 30, 2020
9afc529
Add the classnames utility in order to merge existing classnames with…
Oct 30, 2020
c2778dc
Remove the ocr-related-block class when elements aren't selected anymore
Oct 30, 2020
f8b5eb2
Add `classifai_ocr_text_post_args` filter
helen Oct 30, 2020
be312f7
Make sure we don't constantly set and remove the class if an image bl…
Oct 30, 2020
ccfa4ea
Merge branch 'feature/azure-image-ocr' of github.com:10up/classifai i…
Oct 30, 2020
2ac56fe
Filter post content before it's saved to remove our OCR class that is…
Oct 30, 2020
db0a272
fix: switch to use internal style
dinhtungdu Nov 2, 2020
df91eb2
fix: more sensible timeout
dinhtungdu Nov 2, 2020
af6b4b0
fix: deal with different backgrounds
dinhtungdu Nov 2, 2020
e477cd2
Merge pull request #257 from 10up/try/ocr-internal-style
dinhtungdu Nov 2, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Set PHP version
uses: shivammathur/setup-php@v2
with:
php-version: '7.2'
coverage: none
tools: composer:v1
- name: composer install
run: composer install
- name: PHPCS check
Expand Down
47 changes: 47 additions & 0 deletions includes/Classifai/Helpers.php
Original file line number Diff line number Diff line change
Expand Up @@ -436,3 +436,50 @@ function get_largest_acceptable_image_url( $full_image, $full_url, $sizes, $max

return null;
}

/**
* Retrieves the URL of the largest image that matches filesize and dimensions.
*
* This will check that the filesize of an image matches our requirements and
* if so, will then check the dimensions match our requirements as well. If
* neither match, will move on to the next largest image size.
*
* @param string $full_image The path to the full-sized image source file.
* @param string $full_url The URL of the full-sized image.
* @param array $metadata Attachment metadata, including intermediate sizes.
* @param array $width Array of minimimum and maximum width values. Default 0, 4200.
* @param array $height Array of minimimum and maximum height values. Default 0, 4200.
* @param int $max_size The maximum acceptable filesize. Default 1MB.
* @return string|null The image URL, or null if no acceptable image found.
*/
function get_largest_size_and_dimensions_image_url( $full_image, $full_url, $metadata, $width = [ 0, 4200 ], $height = [ 0, 4200 ], $max_size = MB_IN_BYTES ) {
// Check if the full size image meets our filesize and dimension requirements
$file_size = @filesize( $full_image ); // phpcs:ignore WordPress.PHP.NoSilencedErrors.Discouraged
if (
( $file_size && $max_size >= $file_size )
&& ( $metadata['width'] >= $width[0] && $metadata['width'] <= $width[1] )
&& ( $metadata['height'] >= $height[0] && $metadata['height'] <= $height[1] )
) {
return $full_url;
}

// If the full size doesn't match, run the same checks on our resized images
if ( isset( $metadata['sizes'] ) && is_array( $metadata['sizes'] ) ) {
usort( $metadata['sizes'], __NAMESPACE__ . '\sort_images_by_size_cb' );

foreach ( $metadata['sizes'] as $size ) {
$sized_file = str_replace( basename( $full_image ), $size['file'], $full_image );
$file_size = @filesize( $sized_file ); // phpcs:ignore WordPress.PHP.NoSilencedErrors.Discouraged

if (
( $file_size && $max_size >= $file_size )
&& ( $size['width'] >= $width[0] && $size['width'] <= $width[1] )
&& ( $size['height'] >= $height[0] && $size['height'] <= $height[1] )
) {
return str_replace( basename( $full_url ), $size['file'], $full_url );
}
}
}

return null;
}
222 changes: 209 additions & 13 deletions includes/Classifai/Providers/Azure/ComputerVision.php
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
namespace Classifai\Providers\Azure;

use Classifai\Providers\Provider;
use DOMDocument;
use WP_Error;

use function Classifai\computer_vision_max_filesize;
Expand Down Expand Up @@ -66,6 +67,106 @@ public function register() {
add_filter( 'wp_generate_attachment_metadata', [ $this, 'smart_crop_image' ], 8, 2 );
add_filter( 'wp_generate_attachment_metadata', [ $this, 'generate_image_alt_tags' ], 10, 2 );
add_filter( 'posts_clauses', [ $this, 'filter_attachment_query_keywords' ], 10, 1 );

$settings = $this->get_settings();
$enable_ocr = isset( $settings['enable_ocr'] ) && '1' === $settings['enable_ocr'];

if ( $enable_ocr ) {
add_action( 'enqueue_block_editor_assets', [ $this, 'enqueue_editor_assets' ] );
add_filter( 'the_content', [ $this, 'add_ocr_aria_describedby' ] );
add_filter( 'rest_api_init', [ $this, 'add_ocr_data_to_api_response' ] );
}
}

/**
* Include classifai_computer_vision_ocr in API response.
*/
public function add_ocr_data_to_api_response() {
register_rest_field(
'attachment',
'classifai_has_ocr',
[
'get_callback' => function( $params ) {
return ! empty( get_post_meta( $params['id'], 'classifai_computer_vision_ocr', true ) );
},
'schema' => [
'type' => 'boolean',
'context' => [ 'view' ],
],
]
);
}

/**
* Enqueue the editor scripts.
*/
public function enqueue_editor_assets() {
wp_enqueue_script(
'editor-ocr',
CLASSIFAI_PLUGIN_URL . 'dist/js/editor-ocr.min.js',
[
'wp-block-editor',
'wp-blocks',
],
CLASSIFAI_PLUGIN_VERSION,
true
);
}

/**
* Filter the post content to inject aria-describedby attribute.
*
* @param string $content Post content.
*
* @return string
*/
public function add_ocr_aria_describedby( $content ) {
$modified = false;

if ( ! is_singular() || empty( $content ) ) {
return $content;
}

$dom = new DOMDocument();

// Suppress warnings generated by loadHTML.
$errors = libxml_use_internal_errors( true );
$dom->loadHTML(
sprintf(
'<!DOCTYPE html><html><head><meta charset="%s"></head><body>%s</body></html>',
esc_attr( get_bloginfo( 'charset' ) ),
$content
)
);
libxml_use_internal_errors( $errors );

foreach ( $dom->getElementsByTagName( 'img' ) as $image ) {
foreach ( $image->attributes as $attribute ) {
if ( 'aria-describedby' === $attribute->name ) {
break;
}

if ( 'class' !== $attribute->name ) {
continue;
}

$image_id = preg_match( '~wp-image-\K\d+~', $image->getAttribute( 'class' ), $out ) ? $out[0] : 0;
$ocr_scanned_text_id = "classifai-ocr-$image_id";
$ocr_scanned_text = $dom->getElementById( $ocr_scanned_text_id );

if ( ! empty( $ocr_scanned_text ) ) {
$image->setAttribute( 'aria-describedby', $ocr_scanned_text_id );
$modified = true;
}
}
}

if ( $modified ) {
$body = $dom->getElementsByTagName( 'body' )->item( 0 );
return trim( $dom->saveHTML( $body ) );
}

return $content;
}

/**
Expand All @@ -91,6 +192,7 @@ public function attachment_data_meta_box( $post ) {
$settings = get_option( 'classifai_computer_vision' );
$captions = get_post_meta( $post->ID, '_wp_attachment_image_alt', true ) ? __( 'Rescan Alt Text', 'classifai' ) : __( 'Scan Alt Text', 'classifai' );
$tags = ! empty( wp_get_object_terms( $post->ID, 'classifai-image-tags' ) ) ? __( 'Rescan Tags', 'classifai' ) : __( 'Generate Tags', 'classifai' );
$ocr = get_post_meta( $post->ID, 'classifai_computer_vision_ocr', true ) ? __( 'Rescan Text', 'classifai' ) : __( 'Scan Text', 'classifai' );
$smart_crop = get_transient( 'classifai_azure_computer_vision_smart_cropping_latest_response' ) ? __( 'Regenerate Smart Thumbnail', 'classifai' ) : __( 'Generate Smart Thumbnail', 'classifai' );
?>
<div class="misc-publishing-actions">
Expand All @@ -106,14 +208,20 @@ public function attachment_data_meta_box( $post ) {
<?php echo esc_html( $tags ); ?>
</label>
</div>
<?php if ( $settings && isset( $settings['enable_smart_cropping'] ) && '1' === $settings['enable_smart_cropping'] ) : ?>
<div class="misc-pub-section">
<label for="rescan-smart-crop">
<input type="checkbox" value="yes" id="rescan-smart-crop" name="rescan-smart-crop"/>
<?php echo esc_html( $smart_crop ); ?>
</label>
</div>
<?php endif; ?>
<div class="misc-pub-section">
<label for="rescan-ocr">
<input type="checkbox" value="yes" id="rescan-ocr" name="rescan-ocr"/>
<?php echo esc_html( $ocr ); ?>
</label>
</div>
<?php if ( $settings && isset( $settings['enable_smart_cropping'] ) && '1' === $settings['enable_smart_cropping'] ) : ?>
<div class="misc-pub-section">
<label for="rescan-smart-crop">
<input type="checkbox" value="yes" id="rescan-smart-crop" name="rescan-smart-crop"/>
<?php echo esc_html( $smart_crop ); ?>
</label>
</div>
<?php endif; ?>
</div>
<?php
}
Expand All @@ -128,7 +236,8 @@ public function maybe_rescan_image( $attachment_id ) {
$image_url = get_largest_acceptable_image_url(
get_attached_file( $attachment_id ),
wp_get_attachment_url( $attachment_id ),
$metadata['sizes']
$metadata['sizes'],
computer_vision_max_filesize()
);

if ( filter_input( INPUT_POST, 'rescan-captions' ) ) {
Expand Down Expand Up @@ -158,6 +267,11 @@ public function maybe_rescan_image( $attachment_id ) {
}
}
}

// Are we updating the OCR text?
if ( filter_input( INPUT_POST, 'rescan-ocr' ) ) {
$this->ocr_processing( wp_get_attachment_metadata( $attachment_id ), $attachment_id, true );
}
}

/**
Expand Down Expand Up @@ -220,7 +334,8 @@ public function smart_crop_image( $metadata, $attachment_id ) {
*/
public function generate_image_alt_tags( $metadata, $attachment_id ) {

$settings = $this->get_settings();
$image_scan = false;
$settings = $this->get_settings();
if (
'no' !== $settings['enable_image_tagging'] ||
'no' !== $settings['enable_image_captions']
Expand Down Expand Up @@ -259,6 +374,57 @@ public function generate_image_alt_tags( $metadata, $attachment_id ) {
}
}

// OCR processing
$this->ocr_processing( $metadata, $attachment_id, false, is_wp_error( $image_scan ) ? false : $image_scan );

return $metadata;
}

/**
* Runs text recognition on the attachment.
*
* @since 1.6.0
*
* @filter wp_generate_attachment_metadata
*
* @param array $metadata Attachment metadata.
* @param int $attachment_id Attachment ID.
* @param boolean $force Whether to force processing or not. Default false.
* @param bool|object $scan Previously run image scan. Default false.
* @return array Filtered attachment metadata.
*/
public function ocr_processing( array $metadata = [], int $attachment_id = 0, bool $force = false, $scan = false ) {
$settings = $this->get_settings();

if ( ! is_array( $metadata ) || ! is_array( $settings ) ) {
return $metadata;
}

$should_ocr_scan = isset( $settings['enable_ocr'] ) && '1' === $settings['enable_ocr'];

/**
* Filters whether to run OCR scanning on the current image.
*
* @since 1.6.0
* @hook classifai_should_ocr_scan_image
*
* @param bool $should_ocr_scan Whether to run OCR scanning. The default value is set in ComputerVision settings.
* @param array $metadata Image metadata.
* @param int $attachment_id The attachment ID.
*
* @return bool Whether to run OCR scanning.
*/
if ( ! $force && ! apply_filters( 'classifai_should_ocr_scan_image', $should_ocr_scan, $metadata, $attachment_id ) ) {
return $metadata;
}

$ocr = new OCR( $settings, $scan, $force );
$response = $ocr->generate_ocr_data( $metadata, $attachment_id );

if ( $force ) {
return $response;
}

return $metadata;
}

Expand Down Expand Up @@ -547,6 +713,23 @@ public function setup_fields_sections() {
),
]
);

add_settings_field(
'enable-ocr',
esc_html__( 'Enable OCR', 'classifai' ),
[ $this, 'render_input' ],
$this->get_option_name(),
$this->get_option_name(),
[
'label_for' => 'enable_ocr',
'input_type' => 'checkbox',
'default_value' => false,
'description' => __(
'Detect text in an image and store that as post content',
'classifai'
),
]
);
}

/**
Expand Down Expand Up @@ -580,6 +763,7 @@ public function sanitize_settings( $settings ) {
'enable_image_captions',
'enable_image_tagging',
'enable_smart_cropping',
'enable_ocr',
];

foreach ( $checkbox_settings as $checkbox_setting ) {
Expand Down Expand Up @@ -680,6 +864,7 @@ public function get_provider_debug_information( $settings = null ) {
__( 'Caption threshold', 'classifai' ) => $settings['caption_threshold'] ?? null,
__( 'Latest response - Image Scan', 'classifai' ) => $this->get_formatted_latest_response( get_transient( 'classifai_azure_computer_vision_image_scan_latest_response' ) ),
__( 'Latest response - Smart Cropping', 'classifai' ) => $this->get_formatted_latest_response( get_transient( 'classifai_azure_computer_vision_smart_cropping_latest_response' ) ),
__( 'Latest response - OCR', 'classifai' ) => $this->get_formatted_latest_response( get_transient( 'classifai_azure_computer_vision_ocr_latest_response' ) ),
];
}

Expand Down Expand Up @@ -741,12 +926,23 @@ public function filter_attachment_query_keywords( $clauses ) {
* @return array|string|WP_Error
*/
public function rest_endpoint_callback( $post_id, $route_to_call ) {
$metadata = wp_get_attachment_metadata( $post_id );
$image_url = get_largest_acceptable_image_url(
$metadata = wp_get_attachment_metadata( $post_id );

if ( 'ocr' === $route_to_call ) {
return $this->ocr_processing( $metadata, $post_id, true );
}

$image_url = get_largest_acceptable_image_url(
get_attached_file( $post_id ),
wp_get_attachment_url( $post_id ),
$metadata['sizes']
$metadata['sizes'],
computer_vision_max_filesize()
);

if ( empty( $image_url ) ) {
return new WP_Error( 'error', esc_html__( 'Valid image size not found. Make sure the image is less than 4MB.' ) );
}

$image_scan_results = $this->scan_image( $image_url, [ $route_to_call ] );

if ( is_wp_error( $image_scan_results ) ) {
Expand Down
Loading