getPages() should only ever return elements of type 'Page' #350

Connum · 2020-10-03T18:05:29Z

While investigeting #331, I found that Pages->getPages() will return anything inside the Kids of a Pages object, regardless of type. However, one would expect it to only return actual Page objects, from which text can be parsed and which count towards the page count in the Document's details.

In case another Pages object is encountered, its Pages are already being merged into the resulting array. Other types might need to be processed in a specific way to get Page objects out of them, which is probably the cause for #331. This does not fix the issue, but is a move in the right direction

Reqrefusion · 2020-10-03T22:29:38Z

I tried the change. It doesn't fix #331 but it doesn't create it in another problem.

Connum · 2020-10-04T02:39:15Z

By the way, this does fix a very real situation. When looping over the results of getPages() with the document provided in #331 and calling getDataTm()

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile($file);
$pages  = $pdf->getPages();

foreach ($pages as $index => $page) {
    var_dump($page->getDataTm());
}

That will throw

Fatal error: Uncaught Error: Call to undefined method Smalot\PdfParser\PDFObject::getDataTm() in *** Stack trace: #0 ***: parsePDF('test/Issue331.p...') #1 {main} thrown in *** on line 95
As end user, I shouldn't have to implement type checks here, but should reliably receive Page elements.

@Reqrefusion could you please provide a modified version of the pdf file with a all the probably copyrighted content removed (or replaced with some dummy text), keeping the three pages otherwise intact? That way, we could use that file to create a unit test!

Reqrefusion · 2020-10-04T03:18:02Z

@Connum No need to worry about copyright. Related pdf file is a part of Turkey Official Gazette. This means the following. Official Gazette exist for; laws, regulations, communiqués and similar publications public disclosure. Due to its content, the Official Gazette is not subject to copyright. There are many court orders and regulations about this.

The official newspaper is published daily to serve the public. I am also adding the current issue of the Official Gazette.
https://www.resmigazete.gov.tr/eskiler/2020/10/20201004.pdf

Connum · 2020-10-04T09:27:41Z

Added the sample and a test case. Regarding my comment concerning the page count, it might be a bit unconventional to assert something that we actually know to be incorrect. However, I spent several hours already trying to get to the bottom of this, and I think the cause is rooted quite deep in the parsing process... So I think we don't want to miss out in case this is "accidentally" fixed by some other code changes. In case that happens, the failing test will act as a reminder to resolve the issue, and it will prevent that a fix as a by-effect of some other changes will then be reverted without anyone noticing.

k00ni

Good work on hardening the library. Will merge it next week if there are no objections.

@todo

* getPages() should only ever return elements of type 'Pages' or 'Page' * added @todo to getPages() * added sample file and test case

Connum added 2 commits October 3, 2020 19:47

getPages() should only ever return elements of type 'Pages' or 'Page'

bf23863

added @todo to getPages()

a84e2c1

added sample file and test case

4ee2000

k00ni approved these changes Oct 6, 2020

View reviewed changes

k00ni self-assigned this Oct 6, 2020

k00ni added enhancement unit tests / CI labels Oct 6, 2020

k00ni merged commit c9c2bed into smalot:master Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getPages() should only ever return elements of type 'Page' #350

getPages() should only ever return elements of type 'Page' #350

Connum commented Oct 3, 2020

Reqrefusion commented Oct 3, 2020

Connum commented Oct 4, 2020

Reqrefusion commented Oct 4, 2020

Connum commented Oct 4, 2020

k00ni left a comment •

edited

Loading

getPages() should only ever return elements of type 'Page' #350

getPages() should only ever return elements of type 'Page' #350

Conversation

Connum commented Oct 3, 2020

Reqrefusion commented Oct 3, 2020

Connum commented Oct 4, 2020

Reqrefusion commented Oct 4, 2020

Connum commented Oct 4, 2020

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

k00ni left a comment •

edited

Loading