Added support for PDF attachments. #177

cleitner · 2014-03-30T14:30:10Z

This commit adds support for PDF (global) attachments, which can be used from the command line tool with the -a option, the HTML constructor and <link rel=attachment> elements.

The attachment's data is compressed and a MD5 checksum included in the object stream. The implementation avoids seeking in the PDF stream and copies the data directly without reading the whole resource into memory.

I have tested the feature to work with evince and Adobe Reader 9 on Linux.

Things that need testing include

python2 support
correct documentation (I'm not used to Sphinx)
more readers

TODOs:

I couldn't figure out a way to construct file names for data: URLs. Unfortunately data- attributes are reserved for document authors and <link> has no other suitable attribute. This is of course an obscure use case.
It may suffice to use the basename of the path component for hierarchical URLs like http: and file:. If the filename can be deduced from the URL, the attachment tuples could be simplified to (url, description). This should probably be changed.
There might be a better way to pass the url_fetcher to the PDF writer.
The checksum feature couldn't be tested because either the implementation is wrong and both readers ignore the entry or they just ignore invalid MD5 sums altogether (which renders this feature kind of useless).
It would be great to also support <a rel=attachment> and <area rel=attachment> elements. This would require some bookkeeping and special handling inside the post fixup of links, but shouldn't be too hard to do. I'll probably take a shot at it if this feature is accepted.
The command line option doesn't support a description. Probably not worth the effort.
The command line currently only works with file names, not URLs. This would be fixed with the second TODO.

cleitner · 2014-04-02T20:36:02Z

Is there any interest in this feature?

SimonSapin · 2014-04-02T20:55:22Z

Hi Colin. Thanks a lot for contributing! The code looks good at a quick glance (though I’ll need more time to review it.) But first, about the feature itself.

Can you explain a bit the use case? Why do you want this, how is it useful? What kind of files do you expect to attach?

I’ve never heard of <link rel=attachment> before, and I couldn’t find a specification or standard for it. Did I miss it? Are there existing tools generating HTML like this, or is it HTML code you’re generating for WeasyPrint?

cleitner · 2014-04-02T21:27:17Z

Hi Simon, thanks for this great tool 👍 !

PDF files can embedded arbitrary files which are accessible either through clickable annotations or a global file list (the paperclip). We use it to add data files to PDF reports at the office.

For example Adobe Reader allows you to attach files as annotations in the toolbox, Prince provides an --attach command line option (http://www.princexml.com/doc/9.0/command-line/) and pdfLaTeX has the \attachfile package. I'm sure most PDF creators support file attachments.

I chose the attachment relationship to mark links inside a HTML document, because it feels like a natural match for annotating file attachments. <link> for global, <a> and <area> for annotated attachments. Interesting enough the attachment relationship has been proposed to be added in some future update to the official list (http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions), seemingly from WordPress or one of its plugins. However WeasyPrint would probably be the first converter to support this relationship - I have no idea how other HTML to PDF converters support file attachment annotations, if they support them at all.

Adding support for attachment annotations seems easy enough and I'd update the patch if you're considering it for inclusion, the use case being files related to a section of a document.

If there are any problems with the patch, I'd be happy to change what's necessary.

SimonSapin · 2014-04-02T21:37:09Z

I think adding global attachments from "out of band" parameters (ie. in the Python API or with comand-line flags) is fine, but I’m less certain about the HTML links. Are the links something you need, or just something that seemed nice/easy to add?

As to "annotation attachments" do they appear in a specific position in the document? In that case anything but HTML links might be hard. <a href> already generates a clickable link to an URL. Would <a rel=attachment href> generate an attachment instead or in addition to that?

cleitner · 2014-04-02T21:47:20Z

The OOB attachments are the most important of course. I wouldn't mind leaving the <link> support out.

Annotation attachments should be rendered like any other link, but clicking one of these links would open the viewers "Save as" dialog. It would require similiar treatment as the internal links on the PDF level. They have the /Subtype /FileAttachment instead of /Link. I'd have to read the spec in detail however, especially the /Name parameter seems to require a little bit of extra treatment.

SimonSapin · 2014-04-02T21:49:16Z

The OOB attachments are the most important of course.

What’s OOB?

Annotation attachments should be rendered like any other link, but clicking one of these links would open the viewers "Save as" dialog.

Ok, so instead of linking to the URL.

cleitner · 2014-04-02T21:52:23Z

What’s OOB?

out-of-band

cleitner · 2014-04-04T11:52:15Z

I fixed the important TODOs and a couple of bugs, enhanced the testcase, tested with Python 2 and 3 (MD5 sums of the generated PDFs match) and checked an advanced output with the following readers:

Adobe Reader
Adobe Acrobat X. Preflight reports no syntax errors
evince
Foxit reader
Firefox (shows no errors or warnings, but has no global attachment support)
Windows 8 reader (shows no errors or warnings, but has no global attachment support)
PDFAnnotate (shows no errors or warnings, but has no global attachment support)
Mac OS Preview (shows no errors or warnings, but has no global attachment support)

What I didn't test is how the filesystem encoding might influence the filenames (all test files have been generated on Linux), but as they all go through path2url and unquote, any problems in this area should be easily fixed.

I kept the <link> support, because it feels just right to do something like this and work as expected:

python3 -m weasyprint http://colin.de/test.html output.pdf

I'll add the annotation attachments in a different commit.

cleitner · 2014-04-04T17:57:37Z

I think the patches are ready for review 😓 .

SimonSapin · 2014-04-04T18:18:02Z

Looks like great work, thanks! I still need to take time for the review, sorry…

SimonSapin · 2014-04-07T00:57:32Z

weasyprint/__init__.py

@@ -70,11 +70,15 @@ class HTML(object):
        Defaults to ``'print'``. **Note:** In some cases like
        ``HTML(string=foo)`` relative URLs will be invalid if ``base_url``
        is not provided.
+    :param attachments: A list of tuples, where each element describes an
+        attachment to the document. The tuple contains a URL and a description,


Please mention PDF in this description.

SimonSapin · 2014-04-07T01:45:53Z

I kept the <link> support, because it feels just right to do something like this and work as expected

Correct me if I’m wrong, but this sounds like you’re not gonna use this bit of feature. I’m still uncomfortable with WeasyPrint support non-standard HTML that not only is not in the HTML spec, but is not described in any spec anywhere. So unless you actually want to use this feature, "it feels just right" is not good enough.

Please remove the rel=attachment support (leaving global attachment set from the Python API or the command line). We’ll reconsider if someone actually wants the HTML support.

… of the `Document` class

This patch honors the filename key of a fetched resource, which can be set by the `Content-Disposition` or `Content-Type` headers and uses `mimetypes.guess_extension` for resources that lack any indication of a filename.

SimonSapin · 2014-04-22T00:56:44Z

Alright, let’s do this. Starting a review of a9fd32c and earlier commits:

In the Python API (weasyprint/__init__.py), rather than a tuple, each attachment should be represented with a new Attachment class that is initialized with _select_source, similar to the HTML and CSS classes. This allows adding attachments e.g. form a byte-string in memory. Values from the user that are not already Attachment instances get passed as a guess argument, like in stylesheets. (Admittedly, I’m not sure how to do this and preserve the fact that the content is streamed into the compressed PDF output, and keep making sure file-like objects are closed appropriately.)

This also does the right thing for command-line arguments: a string is interpreted as an URL if it looks like an absolute URL, a filename otherwise. With that, you can remove the URL manipulation code in __main__.py.

I have a slight preference for attachments to be an argument to the HTML.render and HTML.write* methods rather than HTML.__init__, but I could be convinced if there is a reason to not change this.

I’m still not convinced by unquote in compat.py. URL percent-escaping should really be a byte-only affair. Well, the whole handling on URL parsing and bytes vs. Unicode in URLs in WeasyPrint probably should be rewritten, but that’s out of scope for this PR. For now, please leave compat.py unchanged and do what you need to do in pdf.py.

Rather than having a rel attribute on boxes, have a boolean is_attachment and keep the parsing in html.py. About this parsing, the rel HTML attribute is a "set of space-separated tokens", but the HTML spec has a very precised idea of what is whitespace, and what case-insensitive means. Please use the element_has_link_type function that I just added. (I’ll need to rebase on not of master.)

In pdf.py, perhaps you don’t need the hexlify-based conversion flag if you use .hexdigest() instead of .digest().

Regarding the issue of multiple links with the same URL but different descriptions: I think that’s OK.

Regarding the rectangles for links and CSS transforms, please open a separate issue. I don’t know what /AP is, and I’m interested to know if we can do better than axis-aligned rectangles.

In tests, why os.fdopen rather than open?

…nt for `write_pdf`

… type with no `get_filename` method

…'s actually necessary to special case the unquoted result

cleitner · 2014-04-22T19:56:12Z

I'll have to take some time to understand the implications of converting the tuples to a guessed source.

I have a slight preference for attachments to be an argument to the HTML.render and HTML.write* methods rather than HTML.__init__, but I could be convinced if there is a reason to not change this.

Sounds good. I removed the attachments attribute and added a argument to write_pdf. DocumentMetadata will only contain attachments collected from the document itself.

URL percent-escaping should really be a byte-only affair

I moved the special handling to pdf.py. There might actually be a bug if the FILESYSTEM_ENCODING is not UTF-8, as noted in the TODO, but I'm too tired to resolve it right now.

Please use the element_has_link_type function that I just added. (I’ll need to rebase on not of master.)

My git-fu is not good enough to see this through. Is it OK to cherry-pick that commit into this branch and remerging it with the rest of this patch?

In tests, why os.fdopen rather than open?

tempfile.mkstemp returns an OS-level handle, presumly for security reasons. I could probably reuse the temp_directory context from test_api if you prefer that.

cleitner · 2014-04-22T19:58:38Z

I had to fix the filename logic for Python 2 in a8a951b. Should I move that logic into urlopen_contenttype instead (returning a 4 tuple)?

SimonSapin · 2014-04-23T13:48:36Z

I moved the special handling to pdf.py. There might actually be a bug if the FILESYSTEM_ENCODING is not UTF-8, as noted in the TODO, but I'm too tired to resolve it right now.

That’s ok. As said earlier, encoding of URLs and filenames in WeasyPrint overall is busted and need to be rethought. This works for this PR.

My git-fu is not good enough to see this through. Is it OK to cherry-pick that commit into this branch and remerging it with the rest of this patch?

Yeah, that works. We’ll end up with a duplicate commit in the history, which is not ideal but meh.

tempfile.mkstemp returns an OS-level handle, presumly for security reasons. I could probably reuse the temp_directory context from test_api if you prefer that.

Yeah, actually regardless of handles vs. names, you should use temp_directory (maybe move it to testing_utils.py) to make sure the temporary files get cleaned up, even if a test fails.

I had to fix the filename logic for Python 2 in a8a951b. Should I move that logic into urlopen_contenttype instead (returning a 4 tuple)?

Yeah, the idea of urlopen_contenttype doesn’t really work anymore if we keep adding stuff to it. Ideally:

Add urllib_get_content_type, urllib_get_charset, and urllib_get_filename functions to compat.py that take the return value of urlopen() as a parameter. urllib_get_filename can simply return None on Python 2.
Remove urlopen_contenttype, and have default_url_fetcher use urlopen and the above functions instead.

…b_get_charset` and `urllib_get_filename`.

…t files.

… testcase.

…ibute manually.

…ML metadata.

…instead of the URL/description tuples

cleitner · 2014-04-25T22:40:26Z

I hope the changes to support the Attachment class go in the right direction.

Unfortuneatly I had to sacrifice the filename detection because _select_source doesn't return the URL fetch result. Fixing this is probably not a huge deal but I didn't want to change that method - it scared me 😇.

SimonSapin · 2014-04-27T17:17:22Z

Good job!

I pushed the merged commit now that I had it after resolving conflicts, but one remaining issue is that relative_tmp_dir is never used in your tests. Should it be removed, or the test changed to use it?

I also fixed some minor stylistic issues:

Format indentation, blank lines, and other whitespace per PEP 8
Use X not in Y rather than not X in Y
Use X is not Y rather than not X is Y
assert does not need parentheses. I sometimes use parentheses for multi-line expressions, I don’t like \
Remove unused imports

Flake 8 detects most of this automatically.

SimonSapin · 2014-04-28T11:11:10Z

relative_tmp_dir is never used in your tests

Fixed in 9b0488c.

cleitner · 2014-04-29T05:49:23Z

Fixed in 9b0488c.

Sorry, I totally missed that during the refactoring.

Thanks for merging this feature!

SimonSapin · 2014-04-29T08:56:52Z

Let me know if you want to have this in a PyPI release.

cleitner · 2014-04-29T21:01:52Z

That's a kind offer, but until a fix for #132 has been merged I still have to use a patched version of WeasyPrint anyway so no reason to hurry 😄 .

Added support for PDF attachments (v2)

e458380

Added support for PDF file annotations.

7ac01f0

Fixed an expression which led to a KeyError for internal links.

1273432

SimonSapin reviewed Apr 7, 2014
View reviewed changes

cleitner added 3 commits April 18, 2014 15:11

Refactored the url_fetcher argument for write_pdf to an attribute…

846a5be

… of the `Document` class

Added optional filename key to the URL fetcher result

05ec8df

Change filename logic for PDF attachments

a9fd32c

This patch honors the filename key of a fetched resource, which can be set by the `Content-Disposition` or `Content-Type` headers and uses `mimetypes.guess_extension` for resources that lack any indication of a filename.

cleitner added 5 commits April 22, 2014 19:29

Removed usage of unnecessary binascii module in favor of hexdigest

851167f

Small whitespace fix

486834a

Refactored attachments attribute from the HTML class to an argume…

a084a5b

…nt for `write_pdf`

Fixed the default_url_fetcher for Python 2, which returns a message…

a8a951b

… type with no `get_filename` method

Moved the UTF-8 decoding logic from compat.py to pdf.py, where it…

b7a5c46

…'s actually necessary to special case the unquoted result

cleitner and others added 9 commits April 23, 2014 16:24

Replaced urlopen_contenttype with urllib_get_content_type, `urlli…

2be5945

…b_get_charset` and `urllib_get_filename`.

Moved temp_directory to testing_utils to allow reuse in other tes…

e66bb00

…t files.

Use temp_directory in favor of tempfile in the PDF embedded files…

4f3e48d

… testcase.

Fix parsing of <link rel>

a3ef9cc

Use the new element_has_link_type instead of parsing the rel attr…

8c06243

…ibute manually.

Renamed is_internal to link_type, which is less confusing

e244b81

Use element_has_link_type for parsing the rel attribute in the HT…

86e67e5

…ML metadata.

Added an Attachment class for attachments provided through the API …

da916a3

…instead of the URL/description tuples

Updated the documentation on the attachment feature.

96dd798

SimonSapin merged commit 96dd798 into Kozea:master Apr 27, 2014

SimonSapin added a commit that referenced this pull request Apr 27, 2014

Merge branch 'pdf-attachments' from PR #177

830598c

liZe mentioned this pull request Sep 25, 2022

Keep relative links with rel="relative" attribute #1728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for PDF attachments. #177

Added support for PDF attachments. #177

cleitner commented Mar 30, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

cleitner commented Apr 4, 2014

cleitner commented Apr 4, 2014

SimonSapin commented Apr 4, 2014

SimonSapin Apr 7, 2014

SimonSapin commented Apr 7, 2014

SimonSapin commented Apr 22, 2014

cleitner commented Apr 22, 2014

cleitner commented Apr 22, 2014

SimonSapin commented Apr 23, 2014

cleitner commented Apr 25, 2014

SimonSapin commented Apr 27, 2014

SimonSapin commented Apr 28, 2014

cleitner commented Apr 29, 2014

SimonSapin commented Apr 29, 2014

cleitner commented Apr 29, 2014

Added support for PDF attachments. #177

Added support for PDF attachments. #177

Conversation

cleitner commented Mar 30, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

SimonSapin commented Apr 2, 2014

cleitner commented Apr 2, 2014

cleitner commented Apr 4, 2014

cleitner commented Apr 4, 2014

SimonSapin commented Apr 4, 2014

SimonSapin Apr 7, 2014

Choose a reason for hiding this comment

SimonSapin commented Apr 7, 2014

SimonSapin commented Apr 22, 2014

cleitner commented Apr 22, 2014

cleitner commented Apr 22, 2014

SimonSapin commented Apr 23, 2014

cleitner commented Apr 25, 2014

SimonSapin commented Apr 27, 2014

SimonSapin commented Apr 28, 2014

cleitner commented Apr 29, 2014

SimonSapin commented Apr 29, 2014

cleitner commented Apr 29, 2014