-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Gmail takeout mbox import (v2) #8
base: master
Are you sure you want to change the base?
Conversation
Parsing the mbox file manually instead of using Python's built-in parser allows us to process large files without loading them into memory all at once.
This fixes a regression introduced by the previous commit where messages no longer fetch the date from the mbox 'From ' line. For messages without a Date header this means we lose information about the delivery date.
Some messages (like gchat logs) don't have message ids and therefore don't save properly. This commit uses the gmail X-GM-THRID if the Message-Id is missing.
The function email.utils.parsedate_tz expects a str, but we were passing bytes. Casting to str fixes an exception in messages where the Date header is missing and the delivery time must be inferred from the mbox header.
Some messages (like chats) don't have a Message-Id mime header, so the message is saved without a primary key. A previous commit used the thread id in this situation, but the same thread id can be used for multiple messages. This id, which is the message id used by the gmail api, should be unique across all messages.
The docs note: "The policy keyword should always be specified; The default will change to email.policy.default in a future version of Python."
This shouldn't happen in RFC-abiding messages, but raw unicode or other non-ascii content will cause the header parser to return a Header object rather than a str. Improve handling of this case and add a simple unit test.
If the string is invalid, the undecoded string is returned instead.
Just added two more fixes:
I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge. |
In some instances tables would be created with the wrong column types if the initial records had unexpected types. This fixes the issue by explicitly creating the table and specifying types.
Using this newer email parsing code enables parsing of attachments and easier parsing of html emails in the future.
This may be more robust than the tree-walking method we were using earlier, and will enable parsing of html email contents in a future commit.
(Only if no text/plain alternative exists)
I added parsing of text/html emails using BeautifulSoup. Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable. |
@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text. |
Shouldn't be hard. The easiest way is probably to remove the |
@maxhawkins curious why you didn't use the stdlib |
Mailbox parses the entire mbox into memory. Using the lower level library lets us stream the emails in one at a time to support larger archives. Both libraries are in the stdlib. |
Makes sense, thanks for explaining! |
WIP
This PR builds on #5 to continue implementing gmail import support.
Building on @UtahDave's work, these commits add a few performance and bug fixes:
I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.