Skip to content

Commit

Permalink
Add font fallback + Support for font IDs containing hyphens (#614)
Browse files Browse the repository at this point in the history
* Add font choice fallback + fonts w/ hyphen

If a text stream is "decoded" and contains UTF-8 control characters, it probably wasn't decoded using the proper font code page. Add a loop that cycles through all the available fonts to see if there's a better decode choice. Resolves Issue 586.

As well, add the ability to parse font IDs containing dashes (-). Resolves Issue 145

* Update PDFObjectTest.php

Simplify these tests in case future edits change spacing rules.

* Refactor duplicate code into a function

* Use single quoted regexp

Let PCRE handle the conversion rather than PHP. Hopefully fixes PHPStan complaints about null byte.

* Add @param for $command and ?Page

* Proper indentation.

* fixing coding style issues in PDFObject.php

ref: https://cs.symfony.com/doc/rules/function_notation/nullable_type_declaration_for_default_null_value.html

* reverted coding style adaptions

* Remove test case PDF

Remove the Font ID with hyphen test case PDF as we could not contact the submitter to get permission to use it.
Change the unit test to directly test if a Font ID with a hyphen is correctly parsed.

* Add one extra test for font-fallback

Add one more test for font-fallback. This addition also resolves #495.
Catches situations where a null byte \x00 may not be found by preg_match in a unicode context.
Null bytes in the text string usually means that a CIDMap encoded string has been passed through as UTF-8 bytes without being translated by any matching CIDMap pairs.

---------

Co-authored-by: Konrad Abicht <hi@inspirito.de>
  • Loading branch information
GreyWyvern and k00ni committed Jul 31, 2023
1 parent 5c82748 commit ce434c1
Show file tree
Hide file tree
Showing 4 changed files with 80 additions and 7 deletions.
Binary file added samples/ImproperFontFallback.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/Encoding/PDFDocEncoding.php
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ public static function getCodePage(): array
"\xfc" => "\u{00fc}", // udieresis
"\xfd" => "\u{00fd}", // yacute
"\xfe" => "\u{00fe}", // thorn
"\xff" => "\u{00ff}", // ydieresis
"\xff" => "\u{00ff}", // ydieresis
];
}

Expand Down
51 changes: 45 additions & 6 deletions src/Smalot/PdfParser/PDFObject.php
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,39 @@ private function getDefaultFont(Page $page = null): Font
return new Font($this->document, null, null, $this->config);
}

/**
* @param array<int,array<string,string|bool>> $command
*/
private function getTJUsingFontFallback(Font $font, array $command, Page $page = null): string
{
$orig_text = $font->decodeText($command);
$text = $orig_text;

// If we make this a Config option, we can add a check if it's
// enabled here.
if (null !== $page) {
$font_ids = array_keys($page->getFonts());

// If the decoded text contains UTF-8 control characters
// then the font page being used is probably the wrong one.
// Loop through the rest of the fonts to see if we can get
// a good decode.
while (preg_match('/[\x00-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
// If we're out of font IDs, then give up and use the
// original string
if (0 == \count($font_ids)) {
return $orig_text;
}

// Try the next font ID
$font = $page->getFont(array_shift($font_ids));
$text = $font->decodeText($command);
}
}

return $text;
}

/**
* @throws \Exception
*/
Expand Down Expand Up @@ -339,8 +372,11 @@ public function getText(Page $page = null): string
$command[self::COMMAND] = [$command];
// no break
case 'TJ':
$sub_text = $current_font->decodeText($command[self::COMMAND]);
$text .= $sub_text;
$text .= $this->getTJUsingFontFallback(
$current_font,
$command[self::COMMAND],
$page
);
break;

// set leading
Expand Down Expand Up @@ -492,8 +528,11 @@ public function getTextArray(Page $page = null): array
$command[self::COMMAND] = [$command];
// no break
case 'TJ':
$sub_text = $current_font->decodeText($command[self::COMMAND]);
$text[] = $sub_text;
$text[] = $this->getTJUsingFontFallback(
$current_font,
$command[self::COMMAND],
$page
);
break;

// set leading
Expand Down Expand Up @@ -592,7 +631,7 @@ public function getCommandsText(string $text_part, int &$offset = 0): array
case '/':
$type = $char;
if (preg_match(
'/\G\/([A-Z0-9\._,\+]+\s+[0-9.\-]+)\s+([A-Z]+)\s*/si',
'/\G\/([A-Z0-9\._,\+-]+\s+[0-9.\-]+)\s+([A-Z]+)\s*/si',
$text_part,
$matches,
0,
Expand All @@ -603,7 +642,7 @@ public function getCommandsText(string $text_part, int &$offset = 0): array
$command = $matches[1];
$offset += \strlen($matches[0]);
} elseif (preg_match(
'/\G\/([A-Z0-9\._,\+]+)\s+([A-Z]+)\s*/si',
'/\G\/([A-Z0-9\._,\+-]+)\s+([A-Z]+)\s*/si',
$text_part,
$matches,
0,
Expand Down
34 changes: 34 additions & 0 deletions tests/PHPUnit/Integration/PDFObjectTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -256,4 +256,38 @@ public function testReversedChars(): void

$this->assertStringContainsString('שלומי טסט', $pages[0]->getText());
}

/**
* Tests that a text stream with an improperly selected font code
* page falls back to one that maps all characters.
*
* @see: https://github.com/smalot/pdfparser/issues/586
*/
public function testImproperFontFallback(): void
{
$filename = $this->rootDir.'/samples/ImproperFontFallback.pdf';

$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();

$this->assertStringContainsString('сделал', $pages[0]->getText());
}

/**
* Tests that a font ID containing a hyphen / dash character was
* correctly parsed
*
* @see: https://github.com/smalot/pdfparser/issues/145
*/
public function testFontIDWithHyphen(): void
{
$pdfObject = $this->getPdfObjectInstance(new Document());

$fontCommandHyphen = $pdfObject->getCommandsText('/FID-01 15.00 Tf');

$this->assertEquals('/', $fontCommandHyphen[0]['t']);
$this->assertEquals('Tf', $fontCommandHyphen[0]['o']);
$this->assertEquals('FID-01 15.00', $fontCommandHyphen[0]['c']);
}
}

0 comments on commit ce434c1

Please sign in to comment.