-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27
Comments
Thanks for the feedback. I'll review the pattern and try to create a test to reproduce this issue. |
Thanks for the quick response. We are working with sets of a few hundred thousand emails; it would be great to have it fixed. Unfortunately, I can't give you my files for testing for privacy reasons. |
I had a closer look and there is already support for overlapping buffers to prevent loss of the However there was a bug that it wasn't accounting for enough overlap bytes to include the whitespace. I've applied a fix and released as v1.0.1. As this is the first release in quite a while, there may be some unexpected behaviour changes, but if you can try and provide feedback I'll address as needed. |
Great, that was quick. I will give feedback to you when I have done some tests. |
That is looking good. In mstor-1.0.1.pom I had to change the occurrences of For using the MStorStore class I had to change net.fortuna.mstor.MStorStore to net.fortuna.mstor.model.MStorStore I will let you know if I experience anything unexpected. |
Hi, Sorry for being a pain. I think the pattern |
I am using the current version and I get this issue: [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060] see the duplicate index?! For some reason the same index is used twice and the second time the message object is broken. |
Ok that is starting to make sense. I guess if two concurrent messages are less than 8192 bytes then you could potentially have the same message in consecutive buffers. I'll add a fix to ensure duplicate indexes aren't added. |
I just debugged it a bit.... I believe you detect the length of msg by looking at next or prior offset from posList and take the diff. That means for one message the difference is simply 0! This should be the case for the FIRST 16644222 since the next starts at the same index the length for this object is 0. That means, once you fix the issue of the double match the other issue cannot happen anymore. So, fixing the issue with match at the very end is all I would need. |
I think I might even have a possible fix... Current code private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .*$", 8); Proposal private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .+$", 8); Your pattern matches \r\n\r\nFrom_ with NOTHING more. But since you add that exact pattern for another round you should not match it here. This can be fixed by requiring at LEAST 1 more character. That way, you NEVER can have a double hit and the above case would NOT match at the end of the buffer but will be caught the next round. |
With my change: :-) |
Yup I will add this change. In the meantime i pushed a crude fix to just check the last message position to avoid adding it to the index a second time. Release is just pushed should be on maven central shortly. |
Hi,
I think there is a slight issue in getMessagePositions() in /fortuna/mstor/data/MboxFile.java.
An Mbox file is split into managable chunks of
DEFAULT_BUFFER_SIZE
. For each chunk a search is done for the pattern for the beginning of an email:(\\A|\\n{2}|(\\r\\n){2})^From .*$).
When creating the chunks, the last seven characters (
FROM__PREFIX.length() + 2
) from each chunk are added to the beginning of the next chunk. This is to avoid losing aFrom_
match due to splitting through the pattern when creating the chunks. Seven characters cater for any split within\n\nFrom_.
However when
\r\n\r\nFrom_
is split between the 'm' and the blank then we are left with\n\r\nFrom_
. The pattern\\A|\\n{2}|(\\r\\n){2})^From .*$
doesn't match this and the beginning of the email is skipped. Maybe adding\n\r\nFrom_
to the pattern would fix it? This would also cater for the case where an Mbox file contains mixed newline styles (and possibly\n\r\nFrom_
at the beginning of an email).The text was updated successfully, but these errors were encountered: