Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freezes on ð character in subject line #48

Open
benfrancis opened this issue May 2, 2022 · 9 comments
Open

Freezes on ð character in subject line #48

benfrancis opened this issue May 2, 2022 · 9 comments

Comments

@benfrancis
Copy link

Thanks for this tool!

I just ran the script with the following command, on an .mbox file from Google Takeout containing approximately 7,000 emails:

$ python3 imap_upload.py --gmail --box imported takeout.mbox

It seems to have got stuck on an email with a subject line containing an ð ("eth") character. The full subject line is "FW: Youtube Job Wants You 👉 $20K/Month Potential! 80272150" (yes, it appears to be a spam message).

Is there anything I can do to recover from this? If I run the script a second time, will it upload duplicate emails?

@benfrancis
Copy link
Author

I tried cleaning up some of the spam emails and re-exporting. This time it hung on a "â" character.

@rgladwell
Copy link
Owner

rgladwell commented May 4, 2022

It seems to have got stuck on an email with a subject line containing an ð ("eth") character.

Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?

If I run the script a second time, will it upload duplicate emails?

I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails.

If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.

@benfrancis
Copy link
Author

benfrancis commented May 4, 2022

Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?

I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.

It might be possible to reproduce by exporting an .mbox file with an email containing a ð or â character in its subject line. I think that was a real subject line designed to evade spam filters, not garbled output caused by your script. But I agree it seems like a character encoding issue in that something is crashing on certain UTF-8 characters.

I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails.
If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.

In the end I fixed the problem by re-exporting the .mbox without the offending emails, but it took a while to get rid of them all and I had to delete several thousand emails each time I ran the script to avoid duplication. Fortunately the uploaded emails were labelled as "imported" by GMail which made that easy to do.

It would be useful if the script could de-duplicate emails when uploading, but I don't know how hard that is and how it would affect performance.

I've since discovered that Google have a couple of tools for this called mail importer and import-mailbox-to-gmail. They both look harder to use than your script, but the former features de-duplication and the latter has a --from_message parameter to re-start from a certain message number in the mailbox if something goes wrong.

@rgladwell
Copy link
Owner

I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.

The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?

Also, if you're using Google Takeout, have you tried using the --google-takeout-* arguments?

@benfrancis
Copy link
Author

The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?

Yes, it terminated after about 4,000 of 23,000 emails had been uploaded.

Also, if you're using Google Takeout, have you tried using the --google-takeout-* arguments?

No I didn't use that option because I didn't need to preserve labels for this particular upload. I may need that for future uploads though, so will try it next time thanks.

@rgladwell
Copy link
Owner

Could this issue be the cause of the freezing: #49

@benfrancis
Copy link
Author

I think it's unlikely, the .mbox file was about 600MB and the PC the script was running on has 16GB RAM. It would also be a bit of a coincidence that it appeared to stop at unusual characters every time.

@rgladwell
Copy link
Owner

You could try running the script in the new dry-run mode, and capturing all the output.

@adriangibanelbtactic
Copy link
Contributor

Our newest branch: https://github.com/btactic/imap-upload/tree/google_takeout_codepages_fixes_v1 which we will merge soon thanks to #54 deals better with wrong encoding in subject lines.

Prior to this improvement I have never experienced the program to end if there was such a problem with the subject encoding. The only thing that happened in my tests is that this particular email status line was not written and the email was skipped, next email was processed.

So... why don't you give it a go with the google_takeout_codepages_fixes_v1 branch and give us feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants