-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect validation for a variety of emails #22
Comments
Thanks Rohan -- first, I've been out of the email world for a long time, so I'm probably stuck in 2822 and haven't paid attention to the new RFCs, so apologies in advance if any of this is wrong: Nice to see someone taking email validation seriously! Your results do seem to show a number of problems with our library. I'm not involved in maintaining it these days, so maybe @bbottema will have time to take a look at some of it. Some things I noticed or wondered about in the list of demo addresses:
Did you implement with a lexer? Thanks for reaching out. |
@chconnor Thanks for taking a look! I'll try to answer each of these:
That surprised me as well, seems like it should have been caught.
Not sure if this was defined in RFC 2822, but in RFC 5321, the local part cannot be more than 64 octets (this email is 66).
Definitely makes sense, and strictly enforcing any of these email address RFCs is hard!
This may be true. My library failed it because the IPv4 address was invalid, specified in RFC 5321 here. However, I wasn't able to find anything explicitly denying or allowing anything within brackets in the domain. I'll have to dig into that further.
I believe this is also because of RFC 5321, restricting the size of the domain to 255 octets.
Yes, again with the octet limit in RFC 5321 each domain label can only be 63 octets.
Thanks, after looking into it I think you're actually right and I should fix this.
That makes sense as to the difference there. Thanks!
Got it! I agree that nobody should want that :)
Comments should be allowed outside of quoted strings. See RFC 2822: https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3 Excerpt:
In RFC 5322 there are clear examples of comments allowed outside of quotes.
Yes that definitely explains the difference in those cases. With internationalized email addresses, non-ascii characters can be allowed. See RFC 6530
👍🏽
Yes, I did! Regex for email address validation seemed really ugly and difficult to understand for me, so I wanted to give it a shot. I essentially iterate through the email address once, invalidating if I encounter something invalid, and building a meaningful Also, going through this made me realize I want to add descriptions for each of these demo addresses on the website so it's easier to figure out why something should be valid/invalid 🙂 |
Ah yeah, that looks to be a new (and welcome) restriction with the later RFCs. AFAIK 2822 only restricts line length (e.g. for email headers) so you could argue that, in practice, email addresses had to be under <~990 characters, but we didn't implement that, AFAIK.
(EDIT: see following comment first.) I don't think they are allowed in local-part, though, even accounting for the obs-local-part? I.e. I think you have to do "blah(comment)blah"@blah.com. You can do (comment)blah@blah.com or (comment)blah(comment)@blah.com, but I think if local-part itself has parens in it it has to be quoted or escaped?
(...just noting that all those examples have comments outside the local-part.)
Yeah the regex is definitely a grind. +1 for a lexer. Glad someone finally did it. :-)
Yeah, given that 99.999999% of people have no earthly idea the crazy stuff that is allowed by email specs I think documentation is always a good idea. :-) |
Oops, just noticing that your demo emails don't have comments in the local-part! Sorry. |
@bbottema Could this line be why "John Smith" matched? My java regex chops are long gone, but does that line allow for matching mailboxName with an empty uniqueAddrSpec? I can't otherwise see how "John Smith" would validate as an address, since afaict we always require (correctly) at least the @ to be present. |
I'd have to do some debugging to verify that. Maybe I'll get some time to spend it this week. |
Sorry, I actually just took a further look on "John Smith" and it looks like a problem with how my website displays results. The actual address being tested is Sorry for the false alarm on that one! |
Oh, good. I was wondering why javax was validating it also. :-) Looks like you have a great library developing. There have been a few alternative libraries developed over the years but this is the first one that really seems to go the distance on the RFCs and which I would consider using instead of our own. Nice work. @bbottema we might need to finally remove the "The world's only more-or-less-2822-compliant Java-based email address extractor / verifier" tagline. :-) And maybe we could add a "see also" link or something. |
I would even considering archiving this project, as a lexer is a far better way of doing things (including documentation and debugging). By a landslide. Also this particular library not only seems to be correcter compared to our own validations, but more up to date with newer RFC's as well. I'd say jmail supersedes this library on all accounts. I would like to do a performance comparison though, but I suspect a lexer to be a lot faster. I think the artifact footprint is about the same, both having no 3rd party dependencies. |
Thank you both for the kind words! In terms of performance, JMail is on average almost 3x faster than
To calculate those, I run validate on each email address 100 times, averaging all of the times together. // test 100 times to get an avg
int repetitions = 100;
List<Long> times = new ArrayList<>(repetitions);
for (int i = 0; i < repetitions; i++) {
long start = System.nanoTime();
predicate.test(email); // this runs the validate method
long end = System.nanoTime();
times.add(end - start);
}
double avg = times.stream().mapToDouble(n -> n).average().orElse(0.0); |
Quick update that I published version 1.2.0 of JMail which fixes issues that came up during this discussion - emails with quoted identifiers are now correctly handled, as well as empty quoted strings in the local-part. |
Hello! I wrote my own Java email address validation library, JMail, and I wanted to compare it to other implementations (including this one) to see how it stood up.
I tested a wide variety of email addresses using this library with the following line of code:
During this comparison I found some correctness issues with this library, as you can see on the table at this website: https://www.rohannagar.com/jmail/
For example,
Joe Smith
is considered valid with this library. Similarly, domain parts that start or end with the-
character are considered valid when they should not be.I wanted to bring this to your attention so you could potentially make any fixes or improvements, or maybe you can point out something wrong in my interpretation of the RFCs 🙂
The text was updated successfully, but these errors were encountered: