Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize URL regular expression matching rules #391

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Optimize URL regular expression matching rules #391

wants to merge 1 commit into from

Conversation

amengGT
Copy link

@amengGT amengGT commented Dec 22, 2020

Solve the problem that URL recognition is not accurate enough.

@willbelr
Copy link
Contributor

willbelr commented Jan 4, 2021

Instead of reinventing the wheel, perhaps we could use Django regex?

        # This URL validator regular expression belongs to Django Software
        # Copyright (c) Django Software Foundation and individual contributors
        # Please read https://github.com/django/django/blob/master/LICENSE
        self.regex = re.compile(
        r"^(?:http|ftp)s?://"  # http:// or https://
        r"(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|"  # domain
        r"localhost|"  # localhost
        r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|"  # ipv4
        r"\[?[A-F0-9]*:[A-F0-9:]+\]?)"  # ipv6
        r"(?::\d+)?"  # optional port
        r"(?:/?|[/?]\S+)\Z", re.IGNORECASE)

@yan12125
Copy link
Member

The Django one looks awesome. Could you share which Django source file includes that?

@willbelr
Copy link
Contributor

The Django one looks awesome. Could you share which Django source file includes that?

https://github.com/django/django/blob/stable/1.7.x/django/core/validators.py#L68

This is from an earlier version, they made some refactoring after v1.7.x

@yan12125
Copy link
Member

Thanks for the pointer. Indeed the URLValidator class in Django 3.x looks quite complicated.

Looking into Django's regex, I noticed that it requires the protocol part, while the current regex allows "URLs" like www.google.com. I'm fine with using the version from Django 1.7, but the protocol part should be made optional.

@willbelr
Copy link
Contributor

It seems like we can get away by making the protocol part optional. Ideally, this would require an unittest with proper url test cases. Also it does not support foreign addresses, unfortunately..

import re
import unittest

url_validator = re.compile(
r"^((?:http|ftp)s?://)?"  # http:// or https://
r"(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|"  # domain
r"localhost|"  # localhost
r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|"  # ipv4
r"\[?[A-F0-9]*:[A-F0-9:]+\]?)"  # ipv6
r"(?::\d+)?"  # optional port
r"(?:/?|[/?]\S+)\Z", re.IGNORECASE)


class TestDict(unittest.TestCase):
    def test_positives(self):
        positives = (
            "http://google.com",
            "google.com",
        )
        for url in positives:
            test = True if re.findall(url_validator, url) else print(f" {url}")
            self.assertTrue(test)

    def test_negatives(self):
        negatives = (
            "google. com",
        )
        for url in negatives:
            self.assertFalse(re.findall(url_validator, url))


if __name__ == '__main__':
    unittest.main()

@yan12125
Copy link
Member

foreign addresses

Did you mean URLs with non-ASCII characters? That is an issue as the current version seems to support that.

this would require an unittest with proper url test cases

Good idea! Just that it's better to write tests in Qt. I heard there are subtle differences among regular expression implementations.

@willbelr
Copy link
Contributor

willbelr commented Feb 11, 2021

Did you mean URLs with non-ASCII characters? That is an issue as the current version seems to support that.

Yes, I don't think those would match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants