Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

jfarwer · 2022-01-03T19:34:09Z

Hi,

I think there is a slight issue in getMessagePositions() in /fortuna/mstor/data/MboxFile.java.

An Mbox file is split into managable chunks of DEFAULT_BUFFER_SIZE. For each chunk a search is done for the pattern for the beginning of an email: (\\A|\\n{2}|(\\r\\n){2})^From .*$).

When creating the chunks, the last seven characters (FROM__PREFIX.length() + 2) from each chunk are added to the beginning of the next chunk. This is to avoid losing a From_ match due to splitting through the pattern when creating the chunks. Seven characters cater for any split within \n\nFrom_.

However when \r\n\r\nFrom_ is split between the 'm' and the blank then we are left with \n\r\nFrom_. The pattern \\A|\\n{2}|(\\r\\n){2})^From .*$ doesn't match this and the beginning of the email is skipped. Maybe adding \n\r\nFrom_ to the pattern would fix it? This would also cater for the case where an Mbox file contains mixed newline styles (and possibly \n\r\nFrom_ at the beginning of an email).

The text was updated successfully, but these errors were encountered:

benfortuna · 2022-01-04T00:21:00Z

Thanks for the feedback. I'll review the pattern and try to create a test to reproduce this issue.

jfarwer · 2022-01-04T14:20:50Z

Thanks for the quick response. We are working with sets of a few hundred thousand emails; it would be great to have it fixed. Unfortunately, I can't give you my files for testing for privacy reasons.

benfortuna · 2022-01-06T01:25:18Z

I had a closer look and there is already support for overlapping buffers to prevent loss of the From_ header.

However there was a bug that it wasn't accounting for enough overlap bytes to include the whitespace. I've applied a fix and released as v1.0.1.

As this is the first release in quite a while, there may be some unexpected behaviour changes, but if you can try and provide feedback I'll address as needed.

jfarwer · 2022-01-06T09:49:28Z

Great, that was quick. I will give feedback to you when I have done some tests.

jfarwer · 2022-01-07T11:35:45Z

That is looking good. In mstor-1.0.1.pom I had to change the occurrences of <scope>runtime</scope> to <scope>compile</scope>, but that is just because my code uses the dependencies itself without having them in its own pom file.

For using the MStorStore class I had to change net.fortuna.mstor.MStorStore to net.fortuna.mstor.model.MStorStore

I will let you know if I experience anything unexpected.

jfarwer · 2022-01-07T18:57:22Z

Hi,

Sorry for being a pain. I think the pattern (\\A|\\n{2}|(\\r\\n){2})^From .*$
looks for either the beginning of the string or two Unix newlines or two Windows newlines. I guess the beginning of the string is supposed to cater for the beginning of the Mbox file where there are no newlines in front of the From_. However, reading is done in chunks of 8192 characters, and if a chunk starts with From_, that matches the pattern. I have a large Mbox file with a From_ in a line of email text which is taken as the start of an email. I think this is not a new issue. It didn't crop up before because a different overlap was used, meaning the beginning of the chunks was different, and it didn't happen in the Mbox file I was looking at.

stleusc · 2022-01-21T23:10:31Z

I am using the current version and I get this issue:
One of the message objects has null ID, null Subject, etc.
Enabled logging and found this:

[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

see the duplicate index?! For some reason the same index is used twice and the second time the message object is broken.
The second should not exist at all...
Pretty sure related to this...

benfortuna · 2022-01-21T23:19:32Z

Ok that is starting to make sense. I guess if two concurrent messages are less than 8192 bytes then you could potentially have the same message in consecutive buffers. I'll add a fix to ensure duplicate indexes aren't added.

stleusc · 2022-01-21T23:27:28Z

I just debugged it a bit....
The first time it finds the message is at the end of the buffer and the buffer ends in: \r\n\r\nFrom_
It ends with and EXACT hit. Then the next loop you add that EXACT same text to the beginning of the next buffer which results in another hit.

I believe you detect the length of msg by looking at next or prior offset from posList and take the diff. That means for one message the difference is simply 0! This should be the case for the FIRST 16644222 since the next starts at the same index the length for this object is 0.

That means, once you fix the issue of the double match the other issue cannot happen anymore.
I would not even consider it a bug seeing what I see.

So, fixing the issue with match at the very end is all I would need.

stleusc · 2022-01-21T23:31:43Z

I think I might even have a possible fix...

Current code

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .*$", 8);

Proposal

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .+$", 8);

Your pattern matches \r\n\r\nFrom_ with NOTHING more. But since you add that exact pattern for another round you should not match it here. This can be fixed by requiring at LEAST 1 more character. That way, you NEVER can have a double hit and the above case would NOT match at the end of the buffer but will be caught the next round.

stleusc · 2022-01-21T23:35:28Z

With my change:
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

:-)

benfortuna · 2022-01-22T00:15:08Z

Yup I will add this change. In the meantime i pushed a crude fix to just check the last message position to avoid adding it to the index a second time. Release is just pushed should be on maven central shortly.

benfortuna self-assigned this Jan 4, 2022

benfortuna closed this as completed Jan 6, 2022

benfortuna reopened this Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

jfarwer commented Jan 3, 2022

benfortuna commented Jan 4, 2022

jfarwer commented Jan 4, 2022

benfortuna commented Jan 6, 2022

jfarwer commented Jan 6, 2022

jfarwer commented Jan 7, 2022

jfarwer commented Jan 7, 2022

stleusc commented Jan 21, 2022

benfortuna commented Jan 21, 2022

stleusc commented Jan 21, 2022

stleusc commented Jan 21, 2022

stleusc commented Jan 21, 2022

benfortuna commented Jan 22, 2022

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

Comments

jfarwer commented Jan 3, 2022

benfortuna commented Jan 4, 2022

jfarwer commented Jan 4, 2022

benfortuna commented Jan 6, 2022

jfarwer commented Jan 6, 2022

jfarwer commented Jan 7, 2022

jfarwer commented Jan 7, 2022

stleusc commented Jan 21, 2022

benfortuna commented Jan 21, 2022

stleusc commented Jan 21, 2022

stleusc commented Jan 21, 2022

stleusc commented Jan 21, 2022

benfortuna commented Jan 22, 2022