Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for PDF attachments. #177

Merged
merged 28 commits into from
Apr 27, 2014
Merged

Conversation

cleitner
Copy link
Contributor

This commit adds support for PDF (global) attachments, which can be used from the command line tool with the -a option, the HTML constructor and <link rel=attachment> elements.

The attachment's data is compressed and a MD5 checksum included in the object stream. The implementation avoids seeking in the PDF stream and copies the data directly without reading the whole resource into memory.

I have tested the feature to work with evince and Adobe Reader 9 on Linux.

Things that need testing include

  • python2 support
  • correct documentation (I'm not used to Sphinx)
  • more readers

TODOs:

  1. I couldn't figure out a way to construct file names for data: URLs. Unfortunately data- attributes are reserved for document authors and <link> has no other suitable attribute. This is of course an obscure use case.
  2. It may suffice to use the basename of the path component for hierarchical URLs like http: and file:. If the filename can be deduced from the URL, the attachment tuples could be simplified to (url, description). This should probably be changed.
  3. There might be a better way to pass the url_fetcher to the PDF writer.
  4. The checksum feature couldn't be tested because either the implementation is wrong and both readers ignore the entry or they just ignore invalid MD5 sums altogether (which renders this feature kind of useless).
  5. It would be great to also support <a rel=attachment> and <area rel=attachment> elements. This would require some bookkeeping and special handling inside the post fixup of links, but shouldn't be too hard to do. I'll probably take a shot at it if this feature is accepted.
  6. The command line option doesn't support a description. Probably not worth the effort.
  7. The command line currently only works with file names, not URLs. This would be fixed with the second TODO.

@cleitner
Copy link
Contributor Author

cleitner commented Apr 2, 2014

Is there any interest in this feature?

@SimonSapin
Copy link
Member

Hi Colin. Thanks a lot for contributing! The code looks good at a quick glance (though I’ll need more time to review it.) But first, about the feature itself.

Can you explain a bit the use case? Why do you want this, how is it useful? What kind of files do you expect to attach?

I’ve never heard of <link rel=attachment> before, and I couldn’t find a specification or standard for it. Did I miss it? Are there existing tools generating HTML like this, or is it HTML code you’re generating for WeasyPrint?

@cleitner
Copy link
Contributor Author

cleitner commented Apr 2, 2014

Hi Simon, thanks for this great tool 👍 !

PDF files can embedded arbitrary files which are accessible either through clickable annotations or a global file list (the paperclip). We use it to add data files to PDF reports at the office.

For example Adobe Reader allows you to attach files as annotations in the toolbox, Prince provides an --attach command line option (http://www.princexml.com/doc/9.0/command-line/) and pdfLaTeX has the \attachfile package. I'm sure most PDF creators support file attachments.

I chose the attachment relationship to mark links inside a HTML document, because it feels like a natural match for annotating file attachments. <link> for global, <a> and <area> for annotated attachments. Interesting enough the attachment relationship has been proposed to be added in some future update to the official list (http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions), seemingly from WordPress or one of its plugins. However WeasyPrint would probably be the first converter to support this relationship - I have no idea how other HTML to PDF converters support file attachment annotations, if they support them at all.

Adding support for attachment annotations seems easy enough and I'd update the patch if you're considering it for inclusion, the use case being files related to a section of a document.

If there are any problems with the patch, I'd be happy to change what's necessary.

@SimonSapin
Copy link
Member

I think adding global attachments from "out of band" parameters (ie. in the Python API or with comand-line flags) is fine, but I’m less certain about the HTML links. Are the links something you need, or just something that seemed nice/easy to add?

As to "annotation attachments" do they appear in a specific position in the document? In that case anything but HTML links might be hard. <a href> already generates a clickable link to an URL. Would <a rel=attachment href> generate an attachment instead or in addition to that?

@cleitner
Copy link
Contributor Author

cleitner commented Apr 2, 2014

The OOB attachments are the most important of course. I wouldn't mind leaving the <link> support out.

Annotation attachments should be rendered like any other link, but clicking one of these links would open the viewers "Save as" dialog. It would require similiar treatment as the internal links on the PDF level. They have the /Subtype /FileAttachment instead of /Link. I'd have to read the spec in detail however, especially the /Name parameter seems to require a little bit of extra treatment.

@SimonSapin
Copy link
Member

The OOB attachments are the most important of course.

What’s OOB?

Annotation attachments should be rendered like any other link, but clicking one of these links would open the viewers "Save as" dialog.

Ok, so instead of linking to the URL.

@cleitner
Copy link
Contributor Author

cleitner commented Apr 2, 2014

What’s OOB?

out-of-band

@cleitner
Copy link
Contributor Author

cleitner commented Apr 4, 2014

I fixed the important TODOs and a couple of bugs, enhanced the testcase, tested with Python 2 and 3 (MD5 sums of the generated PDFs match) and checked an advanced output with the following readers:

  • Adobe Reader
  • Adobe Acrobat X. Preflight reports no syntax errors
  • evince
  • Foxit reader
  • Firefox (shows no errors or warnings, but has no global attachment support)
  • Windows 8 reader (shows no errors or warnings, but has no global attachment support)
  • PDFAnnotate (shows no errors or warnings, but has no global attachment support)
  • Mac OS Preview (shows no errors or warnings, but has no global attachment support)

What I didn't test is how the filesystem encoding might influence the filenames (all test files have been generated on Linux), but as they all go through path2url and unquote, any problems in this area should be easily fixed.

I kept the <link> support, because it feels just right to do something like this and work as expected:

python3 -m weasyprint http://colin.de/test.html output.pdf

I'll add the annotation attachments in a different commit.

@cleitner
Copy link
Contributor Author

cleitner commented Apr 4, 2014

I think the patches are ready for review 😓 .

@SimonSapin
Copy link
Member

Looks like great work, thanks! I still need to take time for the review, sorry…

@@ -70,11 +70,15 @@ class HTML(object):
Defaults to ``'print'``. **Note:** In some cases like
``HTML(string=foo)`` relative URLs will be invalid if ``base_url``
is not provided.
:param attachments: A list of tuples, where each element describes an
attachment to the document. The tuple contains a URL and a description,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention PDF in this description.

@SimonSapin
Copy link
Member

I kept the <link> support, because it feels just right to do something like this and work as expected

Correct me if I’m wrong, but this sounds like you’re not gonna use this bit of feature. I’m still uncomfortable with WeasyPrint support non-standard HTML that not only is not in the HTML spec, but is not described in any spec anywhere. So unless you actually want to use this feature, "it feels just right" is not good enough.

Please remove the rel=attachment support (leaving global attachment set from the Python API or the command line). We’ll reconsider if someone actually wants the HTML support.

This patch honors the filename key of a fetched resource, which can be set by
the `Content-Disposition` or `Content-Type` headers and uses
`mimetypes.guess_extension` for resources that lack any indication of a
filename.
@SimonSapin
Copy link
Member

Alright, let’s do this. Starting a review of a9fd32c and earlier commits:

In the Python API (weasyprint/__init__.py), rather than a tuple, each attachment should be represented with a new Attachment class that is initialized with _select_source, similar to the HTML and CSS classes. This allows adding attachments e.g. form a byte-string in memory. Values from the user that are not already Attachment instances get passed as a guess argument, like in stylesheets. (Admittedly, I’m not sure how to do this and preserve the fact that the content is streamed into the compressed PDF output, and keep making sure file-like objects are closed appropriately.)

This also does the right thing for command-line arguments: a string is interpreted as an URL if it looks like an absolute URL, a filename otherwise. With that, you can remove the URL manipulation code in __main__.py.

I have a slight preference for attachments to be an argument to the HTML.render and HTML.write* methods rather than HTML.__init__, but I could be convinced if there is a reason to not change this.

I’m still not convinced by unquote in compat.py. URL percent-escaping should really be a byte-only affair. Well, the whole handling on URL parsing and bytes vs. Unicode in URLs in WeasyPrint probably should be rewritten, but that’s out of scope for this PR. For now, please leave compat.py unchanged and do what you need to do in pdf.py.

Rather than having a rel attribute on boxes, have a boolean is_attachment and keep the parsing in html.py. About this parsing, the rel HTML attribute is a "set of space-separated tokens", but the HTML spec has a very precised idea of what is whitespace, and what case-insensitive means. Please use the element_has_link_type function that I just added. (I’ll need to rebase on not of master.)

In pdf.py, perhaps you don’t need the hexlify-based conversion flag if you use .hexdigest() instead of .digest().

Regarding the issue of multiple links with the same URL but different descriptions: I think that’s OK.

Regarding the rectangles for links and CSS transforms, please open a separate issue. I don’t know what /AP is, and I’m interested to know if we can do better than axis-aligned rectangles.

In tests, why os.fdopen rather than open?

@cleitner
Copy link
Contributor Author

I'll have to take some time to understand the implications of converting the tuples to a guessed source.

I have a slight preference for attachments to be an argument to the HTML.render and HTML.write* methods rather than HTML.__init__, but I could be convinced if there is a reason to not change this.

Sounds good. I removed the attachments attribute and added a argument to write_pdf. DocumentMetadata will only contain attachments collected from the document itself.

URL percent-escaping should really be a byte-only affair

I moved the special handling to pdf.py. There might actually be a bug if the FILESYSTEM_ENCODING is not UTF-8, as noted in the TODO, but I'm too tired to resolve it right now.

Please use the element_has_link_type function that I just added. (I’ll need to rebase on not of master.)

My git-fu is not good enough to see this through. Is it OK to cherry-pick that commit into this branch and remerging it with the rest of this patch?

In tests, why os.fdopen rather than open?

tempfile.mkstemp returns an OS-level handle, presumly for security reasons. I could probably reuse the temp_directory context from test_api if you prefer that.

@cleitner
Copy link
Contributor Author

I had to fix the filename logic for Python 2 in a8a951b. Should I move that logic into urlopen_contenttype instead (returning a 4 tuple)?

@SimonSapin
Copy link
Member

I moved the special handling to pdf.py. There might actually be a bug if the FILESYSTEM_ENCODING is not UTF-8, as noted in the TODO, but I'm too tired to resolve it right now.

That’s ok. As said earlier, encoding of URLs and filenames in WeasyPrint overall is busted and need to be rethought. This works for this PR.

My git-fu is not good enough to see this through. Is it OK to cherry-pick that commit into this branch and remerging it with the rest of this patch?

Yeah, that works. We’ll end up with a duplicate commit in the history, which is not ideal but meh.

tempfile.mkstemp returns an OS-level handle, presumly for security reasons. I could probably reuse the temp_directory context from test_api if you prefer that.

Yeah, actually regardless of handles vs. names, you should use temp_directory (maybe move it to testing_utils.py) to make sure the temporary files get cleaned up, even if a test fails.

I had to fix the filename logic for Python 2 in a8a951b. Should I move that logic into urlopen_contenttype instead (returning a 4 tuple)?

Yeah, the idea of urlopen_contenttype doesn’t really work anymore if we keep adding stuff to it. Ideally:

  • Add urllib_get_content_type, urllib_get_charset, and urllib_get_filename functions to compat.py that take the return value of urlopen() as a parameter. urllib_get_filename can simply return None on Python 2.
  • Remove urlopen_contenttype, and have default_url_fetcher use urlopen and the above functions instead.

@cleitner
Copy link
Contributor Author

I hope the changes to support the Attachment class go in the right direction.

Unfortuneatly I had to sacrifice the filename detection because _select_source doesn't return the URL fetch result. Fixing this is probably not a huge deal but I didn't want to change that method - it scared me 😇.

@SimonSapin SimonSapin merged commit 96dd798 into Kozea:master Apr 27, 2014
SimonSapin added a commit that referenced this pull request Apr 27, 2014
@SimonSapin
Copy link
Member

Good job!

I pushed the merged commit now that I had it after resolving conflicts, but one remaining issue is that relative_tmp_dir is never used in your tests. Should it be removed, or the test changed to use it?

I also fixed some minor stylistic issues:

  • Format indentation, blank lines, and other whitespace per PEP 8
  • Use X not in Y rather than not X in Y
  • Use X is not Y rather than not X is Y
  • assert does not need parentheses. I sometimes use parentheses for multi-line expressions, I don’t like \
  • Remove unused imports

Flake 8 detects most of this automatically.

@SimonSapin
Copy link
Member

relative_tmp_dir is never used in your tests

Fixed in 9b0488c.

@cleitner
Copy link
Contributor Author

Fixed in 9b0488c.

Sorry, I totally missed that during the refactoring.

Thanks for merging this feature!

@SimonSapin
Copy link
Member

Let me know if you want to have this in a PyPI release.

@cleitner
Copy link
Contributor Author

That's a kind offer, but until a fix for #132 has been merged I still have to use a patched version of WeasyPrint anyway so no reason to hurry 😄 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants