-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization differences between IDNA::Native and IDNA::Pure #408
Comments
These should definitely be made consistent. Since we likely can't easily influence the behavior of |
@sporkmonger @dentarg gentle ping. Have you given any more thought to this? If you're interested in a PR to resolve the discrepancies we found between |
@brasic I interpret the answer from @sporkmonger that a PR resolving this would be accepted! I agree with it being a major version bump when released. |
irb(main):004:0> s1 = "https://l♥️h.ws"
=> "https://l♥️h.ws"
irb(main):005:0> s2 = "https://l♥h.ws"
=> "https://l♥h.ws"
irb(main):006:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x243d8 URI:https://xn--lh-t0xz926h.ws/>
irb(main):007:0> Addressable::URI.parse(s2).normalize
=> #<Addressable::URI:0x25b5c URI:https://xn--lh-t0x.ws/>
irb(main):008:0> s1.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 65039, 104, 46, 119, 115]
irb(main):009:0> s2.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 104, 46, 119, 115] These two string may look alike depending on your OS/browser but as we can see there's a difference: But in browsers both URL loads https://xn--lh-t0x.ws/ (the modifier is ignored), so I tried with IDNA::Native too ( irb(main):004:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x6130 URI:https://xn--lh-t0x.ws/ This behavior is apparently indicated in the official IDNA mapping table: https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt at this line:
So from what I read above and my analysis I suppose we should fix this in IDNA::Pure and PR would be appreciated? Thanks for you help, if you could just confirm the best place to do it I'll write the PR. ps: 🤔 would it make sense to use the full, official IDNA mapping file in this library maybe? It looks easy to parse, would be very easy to update when new versions are released (latest is v15.1.0), and would greatly help reducing the gap with Native implementation. I'm not saying we should do it nor that I want to do it, but just asking in case. If you think that is a good idea, I might give it a try too (separately). |
Nice spelunking. That does sounds like the solution to me.
I have no idea, but yes, unless @sporkmonger have some old scripts laying around I very much think we need to do it manually. :) It is pretty cool that we have been using this file with serialised data for 10+ years 😄 Here's from when it was still in Ruby (added in 3db3329) (then it was in YAML for a short while before being marshalled) addressable/lib/addressable/idna/pure.rb Lines 316 to 318 in 25c04c9
I think that would be worthwhile to try. |
Your comment above prompted me to look around some ("isn't someone else doing this?") and I found some libraries I've looked at in the past, but had forgotten about: https://github.com/janlelis/unicode-x#unicodex-15- I'm not saying that addressable should use a third-party for this, but I think it is interesting to peak elsewhere. The author, Jan Lelis seems to have been doing Unicode things for a great while. I also found his site https://character.construction/emoji-vs-text, which explains emoji and variation selectors. Thought I should share that (for other readers and my future self). |
I suspect we would take a performance hit (a bit slower to load addressable, gem takes up more storage), but I wonder if it would matter these days. |
@dentarg thanks for your input 🙇♂️ So I started adding the spec for my case and then looked at the code to find where to.. OH MY GOD WHAT THE HELL 😱 Anyway so after spending a couple hours reading the same code and the wiki page over and over again, I kind of understand how it works now, but then I noticed there's actually no place in this code where characters can be skipped, and the So I am down the rabbit hole at this point 🐇 > SimpleIDN.to_ascii("l♥️h.ws")
=> "xn--lh-t0x.ws" It's using a compiled depdency on So at this point I am thinking: A. I see a lot of complexity around unicode normalization and downcasing which appears to be properly supported by Native ruby now, I understand it wasn't the case before which may explain the history. Shall we try to use the ruby version now? it'll probably be better now and more importantly will keep updating in future versions as we can see in https://idiosyncratic-ruby.com/73-unicode-version-mapping.html. This is an unrelated problem but as I dived into it, better list it now while it's in my head. FTR I had a look at the special case reported by @brasic above with
I am fine with all 3 options, but I don't want to spend too much time on this if the PR is never gonna be merged so I would like your opinion on what would you consider. And also if you have other suggestion of course. Sorry for the long message, I knew I shouldn't have looked into this 🤣 but now that I'm in it, I might as well try to help more users. |
Thanks for going down the 🐰 hole @jarthod 😄 . I have not been able to do so myself (but even this short reply took some time to compose, always another thing to look into...) but thought I should reply here rather sooner than later. There's a lot going on here, and I don't have all (or any?) answers. To be honest, I'm not familiar with large parts of the Addressable code base. It is a work of others before me. I'm mainly here tending the garden, or something... :-) I do have opinions though. To keep it short, I like alternative 1 and 3 more than 2. Looks like What are we trying to solve though? Backing up a bit, I'm wondering about what @sporkmonger wrote at #408 (comment)
Is that possible? Looks like "native" is IDNA 2003 and "pure" is IDNA 2008? #247 (comment) (I'm basing this of my example I just did). I wonder what the intention of "pure" is, is it IDNA 2008? Though I guess those questions doesn't have that much with the things pointed out here. I think it would be great to find a way for "pure" to do the right thing in regards to the emoji variation selector. Re: NUL that @brasic pointed out, wouldn't it be interesting to understand why libidn normalize differently? irb(main):015:0> IDN::Stringprep.nfkc_normalize(".\u0000.")
=> "." irb(main):001:0> ".\u0000.".unicode_normalize
=> ".\u0000."
irb(main):002:0> ".\u0000.".unicode_normalize :nfc
=> ".\u0000."
irb(main):003:0> ".\u0000.".unicode_normalize :nfd
=> ".\u0000."
irb(main):004:0> ".\u0000.".unicode_normalize :nfkc
=> ".\u0000."
irb(main):005:0> ".\u0000.".unicode_normalize :nfkd
=> ".\u0000." |
I really shouldn't have opened this pandora box 😅 Sooooo, as I've spent the weekend reading the UTS#46 specification, I'll try to summarise the situation.
AFAICS the expectation is that while all registrar are upgrading to IDNA2008 and only allow valid hostnames (=approximately forever), browsers and web clients in general are encouraged to use IDNA2008+UTS#46 to widen support. So basically unless you're running a registrar, IDNA2008+UTS#46 is the target. That's why As you said, The "pure" implementation is IDNA2008iiiisssshhhhh, but not compliant of course. As we can see in my example with the emoji modifier, if we compare that to the official Unicode test website): So overall what we are trying to solve here is to go from IDNA2003 and IDNA2008ish to both implementation being IDNA2008+UTS#46 compliant. For native it'll be through libidn2 (#247), and for pure we'll have to rewrite some of it (or bring in a dependency). I want to bring one good news though: the Unicode team provide some awesome comformance testing file with thousands of input string and the desired output for IDNA2008+UTS#46, for every version of Unicode, example: https://www.unicode.org/Public/idna/15.0.0/IdnaTestV2.txt This means we should be able to easily add a very extensive test suite to make sure our pure implementation is (and stays) compliant. I'll create another ticket to discuss this (edit: #491) actually because I kind of hijacked this one thinking it was the same problem, but it's not really (see next part) Now getting back to the initial
Ruby implementation is more correct in this regard, and considering it's handling much more recent versions of Unicode, IMO we should switch to using Ruby normalize instead of this one. Or at least use it for the other part of the URL (it's the path which get broken for @brasic, doesn't even have anything to do with IDNA), and only use native I believe this is also likely to be much faster and more complete than the IDNA::Pure normalize implementation. I can make a first separate PR for this maybe to eliminate one problem, while we discuss the others. |
Thanks for really going deep on these topics @jarthod 🙏 Re: @brasic are you able to share anything about what you do at GitHub today? Still using Addressable? Still using pure or did you make the switch to native and libidn? Using something else entirely? |
Re: older comments from me and @brasic above about this change being a major version bump
and my reply
Me and @jarthod touched on that in the PR at #492 (comment) and concluded that this is a bug fix, which means it will go out in the next patch version |
Hello and thanks for your work building and maintaining this useful library!
We use Addressable and
IDNA::Pure
at GitHub for a number of URL parsing and generating tasks. The pure ruby IDN implementation is a bottleneck in some areas (see #407) so we are currently evaluating a switch to libidn viaIDNA::Native
. Our test suite found a few interesting differences between the two implementations when it comes to path normalization of percent-encodedNUL
bytes. Here's an example:The behavior change is ultimately due to the following lower-level difference:
Unfortunately in our testing it seems browsers are split on which is the right way to deal with NUL bytes. RFC3986 has a discussion of
%00
but leaves it up to the application (emphasis mine):Are you interested in harmonizing this difference in normalization between the two IDNA backends and which do you think is the appropriate behavior?
The text was updated successfully, but these errors were encountered: