-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getPages() should only ever return elements of type 'Page' #350
Conversation
I tried the change. It doesn't fix #331 but it doesn't create it in another problem. |
By the way, this does fix a very real situation. When looping over the results of $parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($file);
$pages = $pdf->getPages();
foreach ($pages as $index => $page) {
var_dump($page->getDataTm());
} That will throw
@Reqrefusion could you please provide a modified version of the pdf file with a all the probably copyrighted content removed (or replaced with some dummy text), keeping the three pages otherwise intact? That way, we could use that file to create a unit test! |
@Connum No need to worry about copyright. Related pdf file is a part of Turkey Official Gazette. This means the following. Official Gazette exist for; laws, regulations, communiqués and similar publications public disclosure. Due to its content, the Official Gazette is not subject to copyright. There are many court orders and regulations about this. The official newspaper is published daily to serve the public. I am also adding the current issue of the Official Gazette. |
Added the sample and a test case. Regarding my comment concerning the page count, it might be a bit unconventional to assert something that we actually know to be incorrect. However, I spent several hours already trying to get to the bottom of this, and I think the cause is rooted quite deep in the parsing process... So I think we don't want to miss out in case this is "accidentally" fixed by some other code changes. In case that happens, the failing test will act as a reminder to resolve the issue, and it will prevent that a fix as a by-effect of some other changes will then be reverted without anyone noticing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on hardening the library. Will merge it next week if there are no objections.
* getPages() should only ever return elements of type 'Pages' or 'Page' * added @todo to getPages() * added sample file and test case
While investigeting #331, I found that
Pages->getPages()
will return anything inside theKids
of aPages
object, regardless of type. However, one would expect it to only return actualPage
objects, from which text can be parsed and which count towards the page count in theDocument
's details.In case another
Pages
object is encountered, itsPages
are already being merged into the resulting array. Other types might need to be processed in a specific way to getPage
objects out of them, which is probably the cause for #331. This does not fix the issue, but is a move in the right direction