-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freezes on ð character in subject line #48
Comments
I tried cleaning up some of the spam emails and re-exporting. This time it hung on a "â" character. |
Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?
I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails. If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox. |
I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line. It might be possible to reproduce by exporting an .mbox file with an email containing a ð or â character in its subject line. I think that was a real subject line designed to evade spam filters, not garbled output caused by your script. But I agree it seems like a character encoding issue in that something is crashing on certain UTF-8 characters.
In the end I fixed the problem by re-exporting the .mbox without the offending emails, but it took a while to get rid of them all and I had to delete several thousand emails each time I ran the script to avoid duplication. Fortunately the uploaded emails were labelled as "imported" by GMail which made that easy to do. It would be useful if the script could de-duplicate emails when uploading, but I don't know how hard that is and how it would affect performance. I've since discovered that Google have a couple of tools for this called mail importer and import-mailbox-to-gmail. They both look harder to use than your script, but the former features de-duplication and the latter has a |
The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded? Also, if you're using Google Takeout, have you tried using the |
Yes, it terminated after about 4,000 of 23,000 emails had been uploaded.
No I didn't use that option because I didn't need to preserve labels for this particular upload. I may need that for future uploads though, so will try it next time thanks. |
Could this issue be the cause of the freezing: #49 |
I think it's unlikely, the .mbox file was about 600MB and the PC the script was running on has 16GB RAM. It would also be a bit of a coincidence that it appeared to stop at unusual characters every time. |
You could try running the script in the new dry-run mode, and capturing all the output. |
Our newest branch: https://github.com/btactic/imap-upload/tree/google_takeout_codepages_fixes_v1 which we will merge soon thanks to #54 deals better with wrong encoding in subject lines. Prior to this improvement I have never experienced the program to end if there was such a problem with the subject encoding. The only thing that happened in my tests is that this particular email status line was not written and the email was skipped, next email was processed. So... why don't you give it a go with the |
Thanks for this tool!
I just ran the script with the following command, on an .mbox file from Google Takeout containing approximately 7,000 emails:
$ python3 imap_upload.py --gmail --box imported takeout.mbox
It seems to have got stuck on an email with a subject line containing an ð ("eth") character. The full subject line is "FW: Youtube Job Wants You 👉 $20K/Month Potential! 80272150" (yes, it appears to be a spam message).
Is there anything I can do to recover from this? If I run the script a second time, will it upload duplicate emails?
The text was updated successfully, but these errors were encountered: