Add Gmail takeout mbox import (v2) #8

maxhawkins · 2021-07-28T07:05:32Z

WIP

This PR builds on #5 to continue implementing gmail import support.

Building on @UtahDave's work, these commits add a few performance and bug fixes:

Decreased memory overhead for import by manually parsing mbox headers.
Fixed error where some messages in the mbox would yield a row with NULL in all columns.

I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.

Parsing the mbox file manually instead of using Python's built-in parser allows us to process large files without loading them into memory all at once.

This fixes a regression introduced by the previous commit where messages no longer fetch the date from the mbox 'From ' line. For messages without a Date header this means we lose information about the delivery date.

Some messages (like gchat logs) don't have message ids and therefore don't save properly. This commit uses the gmail X-GM-THRID if the Message-Id is missing.

The function email.utils.parsedate_tz expects a str, but we were passing bytes. Casting to str fixes an exception in messages where the Date header is missing and the delivery time must be inferred from the mbox header.

Some messages (like chats) don't have a Message-Id mime header, so the message is saved without a primary key. A previous commit used the thread id in this situation, but the same thread id can be used for multiple messages. This id, which is the message id used by the gmail api, should be unique across all messages.

The docs note: "The policy keyword should always be specified; The default will change to email.policy.default in a future version of Python."

This shouldn't happen in RFC-abiding messages, but raw unicode or other non-ascii content will cause the header parser to return a Header object rather than a str. Improve handling of this case and add a simple unit test.

If the string is invalid, the undecoded string is returned instead.

maxhawkins · 2021-08-07T00:57:48Z

Just added two more fixes:

Added parsing for rfc 2047 encoded unicode headers
Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.

I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.

In some instances tables would be created with the wrong column types if the initial records had unexpected types. This fixes the issue by explicitly creating the table and specifying types.

Using this newer email parsing code enables parsing of attachments and easier parsing of html emails in the future.

This may be more robust than the tree-walking method we were using earlier, and will enable parsing of html email contents in a future commit.

(Only if no text/plain alternative exists)

maxhawkins · 2021-08-10T23:28:45Z

I added parsing of text/html emails using BeautifulSoup.

Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.

Btibert3 · 2021-12-29T18:58:23Z

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

maxhawkins · 2021-12-31T19:06:20Z

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html" clause from utils.py:254 and just return content directly without parsing.

iloveitaly · 2023-09-06T19:12:33Z

@maxhawkins curious why you didn't use the stdlib mailbox to parse the mbox files?

maxhawkins · 2023-09-07T15:39:59Z

@maxhawkins curious why you didn't use the stdlib mailbox to parse the mbox files?

Mailbox parses the entire mbox into memory. Using the lower level library lets us stream the emails in one at a time to support larger archives. Both libraries are in the stdlib.

iloveitaly · 2023-09-08T01:22:49Z

Makes sense, thanks for explaining!

UtahDave and others added 12 commits February 22, 2021 12:56

Add ability to import Gmail Takeout mbox

8008357

Add some tests

50e7e8d

Format with Black

a3de045

Manually parse mbox format

72802a8

Parsing the mbox file manually instead of using Python's built-in parser allows us to process large files without loading them into memory all at once.

Fix import for messages that don't have a Date

4bc7010

This fixes a regression introduced by the previous commit where messages no longer fetch the date from the mbox 'From ' line. For messages without a Date header this means we lose information about the delivery date.

Use thread id as pkey if missing message id.

8ee555c

Some messages (like gchat logs) don't have message ids and therefore don't save properly. This commit uses the gmail X-GM-THRID if the Message-Id is missing.

Fix parse exception: convert delivery_date to a str

e1fdef7

The function email.utils.parsedate_tz expects a str, but we were passing bytes. Casting to str fixes an exception in messages where the Date header is missing and the delivery time must be inferred from the mbox header.

Explicitly parse email with compat32 policy.

953e7eb

The docs note: "The policy keyword should always be specified; The default will change to email.policy.default in a future version of Python."

Simplify handling of headers with binary data.

50cc883

This shouldn't happen in RFC-abiding messages, but raw unicode or other non-ascii content will cause the header parser to return a Header object rather than a str. Improve handling of this case and add a simple unit test.

Add RFC 2047 parsing to deal with unicode headers.

770bc0e

Deal with invalid rfc 2047 strings.

4f50ff4

If the string is invalid, the undecoded string is returned instead.

maxhawkins force-pushed the mbox_parse branch from c741b9b to 4f50ff4 Compare August 7, 2021 00:45

Make [body] a TEXT column instead of a BLOB.

abb4dfd

maxhawkins changed the title ~~WIP: Add Gmail takeout mbox import (v2)~~ Add Gmail takeout mbox import (v2) Aug 7, 2021

maxhawkins added 8 commits August 8, 2021 13:48

Don't default pkey to thread ID if missing.

2a31dd4

Format with black

d3cf088

Remove dependency on rich.

6a3832c

Create table before inserting, ensuring proper column types.

25ee0a2

In some instances tables would be created with the wrong column types if the initial records had unexpected types. This fixes the issue by explicitly creating the table and specifying types.

Add back progress tracking with Rich.

0bfe031

Use Python 3.6 email parser.

98d89bf

Using this newer email parsing code enables parsing of attachments and easier parsing of html emails in the future.

Use EmailMessage.get_body.

c081ed3

This may be more robust than the tree-walking method we were using earlier, and will enable parsing of html email contents in a future commit.

Parse html emails to plaintext.

8e6d487

(Only if no text/plain alternative exists)

marSar29 mentioned this pull request Aug 2, 2022

Just added two more fixes: marSar29/mbox#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gmail takeout mbox import (v2) #8

Add Gmail takeout mbox import (v2) #8

maxhawkins commented Jul 28, 2021

maxhawkins commented Aug 7, 2021

maxhawkins commented Aug 10, 2021

Btibert3 commented Dec 29, 2021

maxhawkins commented Dec 31, 2021

iloveitaly commented Sep 6, 2023

maxhawkins commented Sep 7, 2023

iloveitaly commented Sep 8, 2023

Add Gmail takeout mbox import (v2) #8

Are you sure you want to change the base?

Add Gmail takeout mbox import (v2) #8

Conversation

maxhawkins commented Jul 28, 2021

maxhawkins commented Aug 7, 2021

maxhawkins commented Aug 10, 2021

Btibert3 commented Dec 29, 2021

maxhawkins commented Dec 31, 2021

iloveitaly commented Sep 6, 2023

maxhawkins commented Sep 7, 2023

iloveitaly commented Sep 8, 2023