-
Notifications
You must be signed in to change notification settings - Fork 0
/
email2dir.py.nw
249 lines (206 loc) · 7.11 KB
/
email2dir.py.nw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
\documentclass[a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[british]{babel}
\usepackage[hyphens]{url}
\usepackage{hyperref}
\usepackage{noweb}
% Needed to relax penalty for breaking code chunks across pages, otherwise
% there might be a lot of space following a code chunk.
\def\nwendcode{\endtrivlist \endgroup}
\let\nwdocspar=\smallbreak
\usepackage[capitalize]{cleveref}
\title{Viewing HTML emails properly from Mutt}
\author{Daniel Bosk}
\begin{document}
\maketitle
\tableofcontents
\clearpage
@
\section{Introduction, command-line usage}
This is the documentation of the [[<<email2dir.py>>]] Python 3 script.
It takes an HTML message, extracts the images and HTML into separate files that
can be handled by a web browser.
It returns the main HTML file.
The main purpose is to run it from within an email client like Mutt.
To open an email for viewing in a web browser, the following usage is
recommended:
\begin{verbatim}
email2dir.py email.eml | xargs ${BROWSER}
\end{verbatim}
This originally started as a modification to Akkana Peck's
[[viewhtmlemail]]\footnote{%
URL: \url{https://github.com/akkana/scripts/blob/master/viewhtmlmail}.
}, but ended up as a complete rewrite.
We use the standard structure for the script.
We have import statements at the top in the [[<<imports>>]] code block.
Then we have functions in the [[<<functions>>]] code block.
Finally, we have a [[main]] function which is run if the script is run directly
but not if it is imported.
<<email2dir.py>>=
#!/usr/bin/env python3
<<imports>>
<<functions>>
def main(argv):
<<main body>>
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
@
Since we rely on the [[sys]] module we need
<<imports>>=
import sys
@
\subsection{Parsing command-line arguments}
The main body is
<<main body>>=
<<parse command-line arguments>>
<<process command-line arguments and do main execution>>
@
To parse command-line arguments we will make use of [[argparse]]:
<<imports>>=
import argparse
@
We first create a parser, then set up the valid arguments and finally provide
the dictionary [[args]] containing each argument that was passed on the command
line.
<<parse command-line arguments>>=
argp = argparse.ArgumentParser(description="Transform email to web folder.")
<<valid argparse arguments>>
args = vars(argp.parse_args(argv[1:]))
@
\subsection{Processing command-line arguments, the main functionality}
We expect a list of email files on the command line.
<<valid argparse arguments>>=
argp.add_argument("file", nargs="+", help="An email file to parse")
@
We want to process each email and put its contents in a temporary directory.
We will use the same temporary directory for each file given, this allows the
following usage:
\begin{verbatim}
ls *.eml | xargs email2dir.py
\end{verbatim}
will put all emails in the same temporary directory.
Whereas,
\begin{verbatim}
for f in $(ls *.eml); do email2dir.py $f; done
\end{verbatim}
will put each email into a separate temporary directory.
We first create a temporary directory and then process each file using the
function [[email2dir]] (defined in \cref{email2dir}).
We will use the [[tempfile]] module to create the temporary directory.
<<imports>>=
import tempfile
@ [[email2dir]] returns a list of file names that are HTML files, we want to
print those to [[stdout]].
<<process command-line arguments and do main execution>>=
tmpdir = tempfile.mkdtemp()
for filename in args["file"]:
try:
file_stream = open(filename)
for html_name in email2dir(file_stream, tmpdir):
print("{0}".format(html_name))
except Exception as err:
print("{0}: {1}".format(argv[0], err), file=sys.stderr)
return 1
finally:
if file_stream:
file_stream.close()
return 0
@
\section{Transforming email files to HTML folders, module usage}
The main processing is done in the function [[email2dir]].
It takes two arguments: [[in_stream]], which is the input email file (as an
open file or byte stream of any kind); and [[out_dir]], which is the output
directory.
We will use the [[email]] module for parsing.
<<imports>>=
import email
@
\label{email2dir}
<<functions>>=
def email2dir(in_stream, out_dir):
with email.message_from_file(in_stream) as msg:
<<read out files from msg, return html files>>
@ This requires the [[email]] module.
We divide the parts of an email into two classes: those parts that are HTML and
those that are not.
Then we write them to file and map the content-IDs to the file names.
<<read out files from msg, return html files>>=
html_parts = {}
other_parts = {}
for part in msg.walk():
# multipart/* are just containers
if part.is_multipart() or part.get_content_type == 'message/rfc822':
continue
elif part.get_content_subtype() == 'html':
content_id, filename = write_part_to_file(part, out_dir)
html_parts[content_id] = filename
else:
content_id, filename = write_part_to_file(part, out_dir)
other_parts[content_id] = filename
<<replace content-IDs with filenames in HTML>>
return list(html_parts.values())
@
The function [[write_part_to_file]] takes a part, chooses a suitable file name
and writes the part into that file in the given output directory, [[out_dir]].
<<functions>>=
def write_part_to_file(part, out_dir):
content_id = None
filename = None
<<get the part's content-ID>>
<<generate a suitable file name from part>>
<<write the part to the file in out-dir>>
return content_id, filename
@
Each part has a set of MIME keys, one of which is the content-ID.
<<get the part's content-ID>>=
for key in list(part.keys()):
if key.lower() == 'content-id':
content_id = part[key]
if content_id.startswith('<') and content_id.endswith('>'):
content_id = content_id[1:-1]
break
else:
continue
if not content_id:
content_id = hash(part)
@
For now, we are not interested in human readable file names.
So we can just use the content-ID:
<<generate a suitable file name from part>>=
filename = content_id
@
Finally, we can write the payload of the part to the file in the output
directory, [[out_dir]].
Note that we require that the file does not exist ([[x]]).
<<write the part to the file in out-dir>>=
with open(os.path.join(out_dir, filename), "xb") as out_stream:
out_stream.write(part.get_payload(decode=True))
out_stream.close()
@ This requires the [[os]] module:
<<imports>>=
import os
@
We must now replace the content-ID references in the HTML with the
corresponding file name.
We have already written the HTML files to disk.
So now we must read them, make changes and then write them back.
<<replace content-IDs with filenames in HTML>>=
for html_file in html_parts.values():
html_src = None
with open(html_file) as html_in:
html_src = html_in.read()
<<set up regex and replace content-ID in html-src>>
with open(html_file) as html_out:
html_out.write(html_src)
@
We will use regular expressions to search for content-ID and replace it with
[[other_parts[content-ID]]].
<<imports>>=
import re
<<set up regex and replace content-ID in html-src>>=
for content_id in other_parts:
html_src = re.sub("cid: ?" + content_id,
"file://" + other_parts[content_id],
html_src, flags=re.IGNORECASE)
@
\end{document}