Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epub exporter does not include embedded attachments; proposal for an output agnostic mechanism #473

Open
mpacer opened this issue Nov 18, 2016 · 5 comments

Comments

@mpacer
Copy link
Member

mpacer commented Nov 18, 2016

Images are being embedded in attachments as base64 encoded strings.

Right now the epub exporter does seem to be getting a link to an attachment like structure for some kind of strange file system query, e.g.:

[NbConvertApp] Converting notebook hide_cells_based_on_tags.ipynb to epub
[NbConvertApp] Writing 9663 bytes to notebook.md
[NbConvertApp] Building Epub
pandoc: Could not find media 'attachment://ScreenShot2016-10-12at19.20.34.png', skipping…
[NbConvertApp] Epub successfully created
[NbConvertApp] Writing 7101 bytes to hide_cells_based_on_tags.epub

(NB: in that ↑ I changed the ` to a ' and the ... to a … for better highlighting)

This makes me think that something similar might be happening (or not happening) elsewhere specifically in #328. Some of the discussion there partially inspires that which is below.

I think there may be a output agnostic way to approach this, as a three step gather, tap, and clean (optionally) process. First, we gather and organise all of the relevant resources into a single location with known relative directory structure. Second, we use format specific mechanisms to include these images. Third, we optionally clean up everything to return it to the state that it was in (if we want it to be a single file per #328 (comment)).

To encapsulate these steps a creating a new directory in which to work will be useful. We can treat the events as happening from the root level of the directory & build up the structure, that means that we can give things canonical known locations in known structures. Then, because it can be done in terms of relative paths, the code that stores and finds files can rely on a common file path function by specifying locations in terms of relative paths as defined in the canonical structure. That takes care of 1. Format specific stuff can then be developed on these common locations, which will take a while but will take care of 2. And then by using temporary directories optionally, that allows for easy cleanup.

For example, the epub reader uses the markdown exporter as an intermediate step, producing the file in a temporary directory. This is because the markdown exporter spits out a bunch of media files to be referenced if they are output. Pandoc's epub exporter can find these files and include them in its native format. However, we do not do this for attached files, instead those are embedded as ![ScreenShot2016-10-12at19.20.34.png](attachment://ScreenShot2016-10-12at19.20.34.png), which does not point to a file system location. If we treat input attachments as we do output, the markdown exporter will be able to make attachments visible as easily as it does the output images. If we change the link to a more appropriate location such as ![ScreenShot2016-10-12at19.20.34.png](./attachment/ScreenShot2016-10-12at19.20.34.png) this would be sufficient to find the attached files.

And we can likely use a similar means we should make it so that the markdown to html conversion can either include these as embedded images or as separate files. In one case you just include the dataURI in the other case you maintain the same mechanism as described above for epub. The same machinery can support both versions. Then, instead of trying to figure out how to pass them in independently , we create them as separate files and then read them back in. Yes it will be less efficient, but then we will have a common mechanism for achieving all of this.

From there we can work backwards and figure out ways to solve the problem in a more efficient manner. But this should be able to be done without a postprocessor but rather as a standard default option based on somewhat common output agnostic machinery.

I'm going to try to make this work for the epub exporter regardless because we're already using a TemporaryWorkingDirectory, so it'll make for a good test case. The way I'll approach it is by giving a hook to do this in the markdown exporter itself, since it's already handling the correct file placement for the output, I figure I can mirror that for the attachments.

Tips on how to make it generalisable are extremely welcome, however I may pursue a local optimum for the epub solution and then try to abstract away from that rather than transform any piece of advise on proper generalisation to code from the get-go. If in a week that hasn't gone anywhere then I'll know I'm barking up the wrong tree because as far as I'm expecting it, this shouldn't be too hard of a modification to make.

Relates to #467.

@Analect
Copy link

Analect commented Mar 21, 2017

@mpacer
Was just wondering if you had any more thoughts around the implementation of this. I love having the ability to embed attachments and self-contain notebooks, but facilitating conversion to markdown necessitates either pushing these images to the cloud (s3?) that can be referenced via a url or saving back to a dedicated folder alongside the notebook.

On further reflection here ... maybe I'm mixing up your concept of epub and markdown. It seems it's possible to embed base64 images within a markdown document using this approach illustrated here.

This red-dot test <img src="" /> won't work in github (they don't permit it, from here, but it does work in a markdown cell on the notebook (in both notebook classic and jupyterlab), however, it doesn't work, for some reason when you try to render a md file as markdown in jupyterlab.

@takluyver
Copy link
Member

I think the attachment urls in the markdown output are simply a missing feature - data embedded in attachments should be extracted to separate files, like we already do with images embedded in outputs.

+1 to designing a good mechanism for an intermediate step with files stored in a temporary directory. This is what nbconvert to PDF does as well (with Latex as the intermediate), and there are problems with that (see #552), so we have at least two cases for it.

@Analect
Copy link

Analect commented Mar 21, 2017

@takluyver ... thanks for your input here. I may be misunderstanding things, but in the case of markdown, the files generated from embedded images will be in a directory that can't just be temporary and thrown away, as with the PDF solution, since the md file will be referencing them ... unless you use something like:
![an example red dot]()
or
<img src="" /> to embed the image in the markdown, but from above, not all markdown renderers are willing to handle this .. although I see http://dillinger.io/ does handle the first format above.

@takluyver
Copy link
Member

For Markdown export, it wouldn't use a temporary directory - we already have a way to create a permanent directory on export, which is used when extracting images from outputs.

The temporary directory would be for converting to epub, which makes a markdown intermediate and then converts that to epub. A quick Google suggests that images etc. are contained inside the epub file (which is actually a zip archive), so it doesn't need to reference images in a separate file.

@Analect
Copy link

Analect commented Mar 21, 2017

Thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants