Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getPages() should only ever return elements of type 'Page' #350

Merged
merged 3 commits into from
Oct 12, 2020

Conversation

Connum
Copy link
Contributor

@Connum Connum commented Oct 3, 2020

While investigeting #331, I found that Pages->getPages() will return anything inside the Kids of a Pages object, regardless of type. However, one would expect it to only return actual Page objects, from which text can be parsed and which count towards the page count in the Document's details.

In case another Pages object is encountered, its Pages are already being merged into the resulting array. Other types might need to be processed in a specific way to get Page objects out of them, which is probably the cause for #331. This does not fix the issue, but is a move in the right direction

@Reqrefusion
Copy link

I tried the change. It doesn't fix #331 but it doesn't create it in another problem.

@Connum
Copy link
Contributor Author

Connum commented Oct 4, 2020

By the way, this does fix a very real situation. When looping over the results of getPages() with the document provided in #331 and calling getDataTm()

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile($file);
$pages  = $pdf->getPages();

foreach ($pages as $index => $page) {
    var_dump($page->getDataTm());
}

That will throw

Fatal error: Uncaught Error: Call to undefined method Smalot\PdfParser\PDFObject::getDataTm() in *** Stack trace: #0 ***: parsePDF('test/Issue331.p...') #1 {main} thrown in *** on line 95
As end user, I shouldn't have to implement type checks here, but should reliably receive Page elements.

@Reqrefusion could you please provide a modified version of the pdf file with a all the probably copyrighted content removed (or replaced with some dummy text), keeping the three pages otherwise intact? That way, we could use that file to create a unit test!

@Reqrefusion
Copy link

@Connum No need to worry about copyright. Related pdf file is a part of Turkey Official Gazette. This means the following. Official Gazette exist for; laws, regulations, communiqués and similar publications public disclosure. Due to its content, the Official Gazette is not subject to copyright. There are many court orders and regulations about this.

The official newspaper is published daily to serve the public. I am also adding the current issue of the Official Gazette.
https://www.resmigazete.gov.tr/eskiler/2020/10/20201004.pdf

@Connum
Copy link
Contributor Author

Connum commented Oct 4, 2020

Added the sample and a test case. Regarding my comment concerning the page count, it might be a bit unconventional to assert something that we actually know to be incorrect. However, I spent several hours already trying to get to the bottom of this, and I think the cause is rooted quite deep in the parsing process... So I think we don't want to miss out in case this is "accidentally" fixed by some other code changes. In case that happens, the failing test will act as a reminder to resolve the issue, and it will prevent that a fix as a by-effect of some other changes will then be reverted without anyone noticing.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on hardening the library. Will merge it next week if there are no objections.

@k00ni k00ni self-assigned this Oct 6, 2020
@k00ni k00ni merged commit c9c2bed into smalot:master Oct 12, 2020
partulaj pushed a commit to partulaj/pdfparser that referenced this pull request Dec 21, 2020
* getPages() should only ever return elements of type 'Pages' or 'Page'

* added @todo to getPages()

* added sample file and test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants