Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix userinfo in case of some special characters #144

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kastakhov
Copy link

Fix for #143

}
if ($str =~ m,^((?:$URI::scheme_re:)?)//(.*:.*@)?([^/?\#]*)(.*)$,os) {
my $scheme = $1;
my $ui = $2 || '';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a more descriptive name for this variable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this variable was not your creation. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I have renamed to $userinfo.
I decided to rename in whole file as it might be a little confusing when we're using $ui or $userinfo.

lib/URI.pm Outdated
@@ -123,7 +123,7 @@ sub _fix_uric_escape_for_host_part {
return;
}

if ($_[0] =~ m{^((?:$URI::scheme_re:)?)//([^/?\#]+)(.*)$}os) {
if ($_[0] =~ m{^((?:$URI::scheme_re:)?)//((.*:.*@)?[^/?\#]+)(.*)$}os) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a discussion with @genio on IRC:

It would be nice to avoid (.*@) or (.*:.*@) as it's greedy and could be prone to bugs.

We could take inspiration from https://metacpan.org/dist/Mojolicious/source/lib/Mojo/URL.pm#L52-64

After gathering the host and port part, ^([^@]+@) should be fine.

Copy link
Author

@kastakhov kastakhov Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but what if user wanna use '@' in password, using [^@]+@ will cause that the everything before first '@' symbol will be matched and password parsed incorrectly.

like the similar approach is already used in LWP/Protocol/http.pm.
And I already mentioned what issue it might cause in my PR.

For my standpoint (.*:.*@) is pretty lazy, but okay, I agree that it could be better.

@oalders, what you think about ([^:]+:.*@) for matching user info?

This will match everything except : as username cannot contain colon, then match : as separator, match everything after first colon as password or nothing as according to rfc3986#section-3.2.1 it might be empty, and finally match @ as second separator.

Copy link
Author

@kastakhov kastakhov Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the regex which Mojo/URL.pm#L52-64 is used also cannot parse username/password if it contains any of /#? symbols :)
I know, it's really rarely usecase, but unfortunately I faced with it already two times :d

For instance, this regex with username and password which contain all special characters in unescaped form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but what if user wanna use '@' in password, using [^@]+@ will cause that the everything before first '@' symbol will be matched and password parsed incorrectly.

Then it should be URL-encoded as %40. I'm not sure if there's any way around that. Something has to serve as a stopping point. If the RE is too greedy, it might end up stopping with a @ inside of the path, like http://host/path/directory@foobar/file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RE is too greedy, it might end up stopping with a @ inside of the path, like http://host/path/directory@foobar/file.

Hm.
Now I see, it'll match this path incorrectly.

Copy link
Author

@kastakhov kastakhov Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually I can revert this change as the most important part is in lib/URI/_server.pm#14.

There are another issue, if password contain multiple '@' symbols before any of '/#?' symbol URI parsing fails again :d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are another issue, if password contain multiple '@' symbols before any of '/#?' symbol URI parsing fails again :d

If the password was inserted directly into the URL that way without percent-encoding, that's not really the URI module's fault. The password has to be encoded properly first. That's the point of escaping characters.

@oalders
Copy link
Member

oalders commented Aug 20, 2024

Thanks very much for this @kastakhov. If we could tweak the regex in both places as suggested, that would be helpful.

@kastakhov
Copy link
Author

Hi, @oalders, @genio, any another suggestions for these changes or now it looks good for you?

@SineSwiper
Copy link
Contributor

SineSwiper commented Aug 26, 2024

This is very adjacent to this issue, and feel free to tell me to open a different issue/PR, but...

The user/password methods allow bad data to pollute the URL (lines 45 and the similar line 22), because it is only doing a very light set of manual URL-escaping. It should, at a minimum escape potential @ characters in the password ($new =~ s/@/%40/g), and at a maximum, just call uri_escape for both user/password sets.

@oalders
Copy link
Member

oalders commented Aug 26, 2024

Is there a good reason not to just call uri_escape?

@genio
Copy link
Member

genio commented Aug 26, 2024

I still don't think this is a good idea, as you're supposed to URI-escape things. Here's an example with both Mojo and URI:

#!/usr/bin/env perl

use v5.36;
use strict;
use warnings;

use Mojo::URL ();
use Mojo::Util ();
use URI ();
use URI::Escape ();

my $unescaped_u = 'u1!"#$%&\'()*+,-./;<=>?@[\]^_`{|}~';
my $unescaped_p = 'p1!"#$%&\'()*+,-./;<=>?@[\]^_`{|}~';

{
    # mojolicious way:
    my $username = Mojo::Util::url_escape($unescaped_u);
    my $password = Mojo::Util::url_escape($unescaped_p);
    my $foo = "http://${username}:${password}\@example.com/pub/a/2001/08/27/bjornstad.html";
    # say "Mojo: ", $foo;
    my $url = Mojo::URL->new($foo);
    # say "Mojo: ", $url->to_string();
    say "Mojo: ", $url->userinfo();
}

{
    # URI way
    my $username = URI::Escape::uri_escape($unescaped_u);
    my $password = URI::Escape::uri_escape($unescaped_p);
    my $foo = "http://${username}:${password}\@example.com/pub/a/2001/08/27/bjornstad.html";
    # say "URI:  ", $foo;
    my $url = URI->new($foo);
    # say "URI:  ", $url->as_string();
    say "URI:  ", URI::Escape::uri_unescape($url->userinfo());
}

Yielding:

% perl url.pl
Mojo: u1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~:p1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~
URI:  u1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~:p1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~

@SineSwiper
Copy link
Contributor

While I agree everybody should URI-escape the UN/PWs, it should still try to safely parse as much as it can. IMO, [^:]+:[^@]+@ should work to capture userinfo without escaping out to parts that it shouldn't.

Also, I'll just create a different issue for the user/password set problem.

@SineSwiper
Copy link
Contributor

Well, disregard my user/password set problem. I realized that userinfo is running through URI::Escape::escape_char($1) and everything works correctly, as long as it didn't start off confused (ie: URI->new with reserved characters in the password).

@kastakhov
Copy link
Author

kastakhov commented Aug 26, 2024

While I agree everybody should URI-escape the UN/PWs, it should still try to safely parse as much as it can. IMO, [^:]+:[^@]+@ should work to capture userinfo without escaping out to parts that it shouldn't.

Also, I'll just create a different issue for the user/password set problem.

Just in case, originally I faced with this issue here..

make _uric_escape regex more specific for userinfo part
update UT
@kastakhov
Copy link
Author

Now the only one issue still here.... when password contain '@' unescaped symbol before '/#?' symbols, user info part matches incorrectly. 🙃

@genio
Copy link
Member

genio commented Aug 26, 2024

The seemingly relevant RFC pieces are linked here:
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.1
https://datatracker.ietf.org/doc/html/rfc3986#section-7.3
https://datatracker.ietf.org/doc/html/rfc3986#section-7.5

I think my cursory reading of these and understanding thus far leads me to believe we should leave things as-is and maybe provide documentation for how to provide complex authentication using URI::Escape::uri_escape. I believe making changes to the behavior here could result in unexpected double percent encoding for some folks who are already doing it the way in the sample script above.

@SineSwiper
Copy link
Contributor

SineSwiper commented Aug 27, 2024

I found Section 2.2 earlier, but didn't see where those declarations were used. Good find.

So, this means we should stay away from unnecessarily parsing gen-delims in the userinfo section. At the very least, .* is way too broad.

Now the only one issue still here.... when password contain '@' unescaped symbol before '/#?' symbols, user info part matches incorrectly.

Again, a delimiter has to exist unescaped. There's no such thing as a parser algorithm that can take in every ASCII or Unicode unescaped character as a sub-field of the whole string. At some point, we have to force the user to escape their data before it gets dropped into a URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants