Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mbox format #35

Closed
ghost opened this issue Jan 24, 2020 · 5 comments
Closed

Mbox format #35

ghost opened this issue Jan 24, 2020 · 5 comments

Comments

@ghost
Copy link

ghost commented Jan 24, 2020

Hi,

First of thanks for these scripts, they were a huge help and this question is possibly out of scope of the project. I downloaded a group with around 90k messages with no issues; I adapted the wget.sh outputted script slightly with the modification provided in #32. All the messages are now in $GROUP/mbox formatted with RFC 822.

I am looking to convert this to an actual single mbox file format, the problem I am having is I can't get the format correct. I have tried just joining the individual files together but that does not create a valid mbox format.

find $GROUP/mbox/ -type f | while read f; do cat $f >> tmp.mbox; done

I have also tried to format it using procmails formail.

for f in $GROUP/mbox/*; do formail -b < "$f" >> test2.mbox; done

while this command does work, it adds the current time to the FROM field instead of using the posted date. So when you open the file in say mutt, it shows the wrong date.

for f in $GROUP/mbox/*; do formail -a "Date:" < "$f" >> test2.mbox; done

This command creates an invalid mbox file:

mutt -f test2.mbox
Invalid mbox format

Any ideas how I can get this to a valid mbox format?

@pmwheatley
Copy link

I've found it to not be quite as simple as just concatenating the messages unfortunately. mbox seems to depend on the first line of every message being the From: line. The crawler currently just downloads the raw message (which may or may not have the From: line as the first line).

Something similar to this bash command (echo '/^[Ff][Rr][Oo][Mm]:/m0'; echo w; echo q ) | ed $file will move the first matching From line to the top of the message in preparation for concating them into an mbox.

I'll maybe submit a PR for this if I get to a point where it seems to be working consistently.

@pmwheatley
Copy link

Also potential issues around escaping other occurrences of From to look at: http://fileformats.archiveteam.org/wiki/Mbox

@icy
Copy link
Owner

icy commented Apr 9, 2020

Hi @beardyjay and @pmwheatley , I'll have a look at how I played with the mbox files. I didn't notice there is an issue in my project. I shoot myself in the foot with my notification setting :(

@icy
Copy link
Owner

icy commented Apr 12, 2020

I think I also followed the same way as @pmwheatley suggested #15 (comment) . I believe I did that with a little Ruby script, which I lost in my bunch of files now.

@icy
Copy link
Owner

icy commented Apr 12, 2020

oh, I just found I also have a small script for converting https://github.com/icy/bashy/blob/master/libs/raw2mbox.sh , but I haven't used them for so long time. It's good for a reference purpose.

@icy icy closed this as completed Apr 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants