Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert 4f4fd10 preserving fix for #260, fixing #319, #322 and #334 #342

Merged
merged 3 commits into from
Sep 30, 2020

Conversation

Connum
Copy link
Contributor

@Connum Connum commented Sep 22, 2020

This reverts the change from 4f4fd10 that caused at least issues #319, #322 and #334 (maybe even more...).
With a tiny change to the original code, it was confirmed in #260 that it still fixes that issue.

I also wrote a test for #319 locally, but wasn't sure whether we could use the sample PDF or not:

/**
 * Test that issue related pdf can now be parsed
 *
 * @see https://github.com/smalot/pdfparser/issues/319
 */
public function testIssue319()
{
    $filename = $this->rootDir.'/samples/bugs/Issue319.pdf';

    $document = $this->fixture->parseFile($filename);

    $this->assertStringContainsString('HS technology s.r.o., IČO: 52081061, Sídlo: Fraňa Mojtu 22, 94901 Nitra,\n
    Slovenská republika\n
    \n
    \n
    99,10', $document->getText());
}

@k00ni k00ni added the fix label Sep 23, 2020
@k00ni k00ni self-assigned this Sep 23, 2020
@k00ni k00ni linked an issue Sep 23, 2020 that may be closed by this pull request
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Your patch seems reasonable.

I also wrote a test for #319 locally, but wasn't sure whether we could use the sample PDF or not:

  • What does that mean? Is it available now?

  • I saw that you asked around in the issues you referenced. Can you give me a short summary about what you found out? That would be helpful.

  • Please address my in-code comments.

EDIT: Solved in next review.

@k00ni
Copy link
Collaborator

k00ni commented Sep 23, 2020

@romanmatyus (#319), @Karmel0x (#322) and @CharmaineLohSiYing (#334): It would be great if you could give us a short feedback if this PR fixes your problems. Thank you in advance!

@Connum
Copy link
Contributor Author

Connum commented Sep 23, 2020

I also wrote a test for #319 locally, but wasn't sure whether we could use the sample PDF or not:

* [X]  What does that mean? Is it available now?

I can't include that test without a sample file to test against. It means I was able to verify that my changes fix the parsing of the file provided in #319, but we can't simply use that file without consent. I can try to reduce the file, but I fear changing and re-saving it could result in the original issue not occuring in the first place...

* [X]  I saw that you asked around in the issues you referenced. Can you give me a short summary about what you found out? That would be helpful.

I basically just confirmed that the issues were caused by the fix in 4f4fd10. In #260, the issue related to that fix, the author of the fix confirmed that with my changes their (not publically available) PDF still worked fine, affirming that we can have all three new issues fixed at the same time without reintroducing the issue raised in #260.

* [X]  Please address my in-code comments.

Done, I will address those in an additional commit.

@CharmaineLohSiYing
Copy link

@romanmatyus (#319), @Karmel0x (#322) and @CharmaineLohSiYing (#334): It would be great if you could give us a short feedback if this PR fixes your problems. Thank you in advance!

Hello! This PR fixes my problems. All the characters in my pdf file are now output properly just like in the demo. Thank you!

@Connum Connum requested a review from k00ni September 23, 2020 14:43
@Connum
Copy link
Contributor Author

Connum commented Sep 23, 2020

@k00ni: The sample PDF from #319 is signed, so it cannot be edited directly. Once I print it to a new PDF or import and export it with Illustrator, it does seem to introduce some new issues, but not necessarily the one we need to reproduce...

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample PDF from #319 is signed, so it cannot be edited directly. Once I print it to a new PDF or import and export it with Illustrator, it does seem to introduce some new issues, but not necessarily the one we need to reproduce...

We can't use it, because it contains a logo, but its optional for this PR so we can ignore it.

  • I saw that you asked around in the issues you referenced. Can you give me a short summary about what you found out? That would be helpful.

Solved 👍

@k00ni
Copy link
Collaborator

k00ni commented Sep 25, 2020

I will keep that open for a ~week to allow others to comment. If there are no objections, I will merge.

@MrMadClown
Copy link

Hi,
might be related to 4f4fd10.
The text retrieved from This PDF is pretty scrambled in the current version (0.16.2). I reverted to 0.14.0 (pre 4f4fd10) no more scrambling.
I'm still seeing way more spaces in the parsed Text than I can see in the PDF but those might actually be zero-width-spaces not sure.

@k00ni
Copy link
Collaborator

k00ni commented Sep 28, 2020

Can you please check if you have to same behavior with the version in this branch (https://github.com/Connum/pdfparser/tree/fix-260-319-322-334)?

@Connum
Copy link
Contributor Author

Connum commented Sep 29, 2020

I can confirm that the parsing of the PDF provided by @MrMadClown is also fixed with this PR! One example:

Before

Gestaltung und Sat��4GIGNKPFKU�9GUVRJCN�)TC��M�&GUKIn / Norbert Lauterbach

After

Gestaltung und Satz: Regelindis Westphal Grafik-Design / Norbert Lauterbach

All other instances of scrambled text are fixed as well, no more "strange" characters!

@k00ni
Copy link
Collaborator

k00ni commented Sep 30, 2020

Thank you @Connum and the others for your effort.

@k00ni k00ni merged commit 1862686 into smalot:master Sep 30, 2020
@Connum Connum deleted the fix-260-319-322-334 branch October 1, 2020 06:33
partulaj pushed a commit to partulaj/pdfparser that referenced this pull request Dec 21, 2020
…ot#322 and smalot#334 (smalot#342)

* Revert "Fix \f crush"

This reverts commit 4f4fd10.

* revert 4f4fd10 preserving fix for smalot#260, fixing smalot#319, smalot#322 and smalot#334

* reduced sample PDF and added more descriptive comment to the two new test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants