Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check dns concurrently to speed up lookup #91

Open
nicolasassi opened this issue Jun 16, 2021 · 8 comments
Open

check dns concurrently to speed up lookup #91

nicolasassi opened this issue Jun 16, 2021 · 8 comments

Comments

@nicolasassi
Copy link

Is there any room to check dns concurrently as it might be a time consuming task? I've found some hacky ways to do that, but maybe it could be a flag on find_urls.
I'm aware this would increase the complexity of this function a lot.

@lipoja
Copy link
Owner

lipoja commented Jun 16, 2021

I would like to invite @jayvdb to this conversation, since he is the one who implemented DNS checking.

@nicolasassi
Copy link
Author

nicolasassi commented Jun 16, 2021

for reference, this is the hacky way I was refering to:

from urlextract import URLExtract
from app.infrastructure.analyser.finder import Finder
from pebble import concurrent
from concurrent.futures import TimeoutError


class Hyperlinks(Finder):

    def __init__(self):
        self.extractor = URLExtract()

    @concurrent.process(timeout=2)
    def _check_dns(self, text: str):
        return self.extractor.has_urls(text, check_dns=True)

    def find(self, text: str) -> list:
        extractor = URLExtract()
        urls = extractor.find_urls(text, only_unique=True, check_dns=False)
        processes = list()
        for indx, url in enumerate(urls):
            processes.append((indx, self._check_dns(url)))
        for process in processes:
            try:
                if not process[1].result():
                    urls.pop(process[0])
            except TimeoutError:
                urls.pop(process[0])
        return urls

f = Hyperlinks()
f.find("https://github.com/lipoja/URLExtract/issues/91", "https://github.com/lipoja/URLExtract/issues/90")

@jayvdb
Copy link
Contributor

jayvdb commented Jun 16, 2021

Sounds like a great idea, and should complement caching nicely. Also pebble seems like a good choice.

@nicolasassi
Copy link
Author

I was studying the code to find a way to achive concurrency in the dns lookup. But I kind got stuck because I seems that there's a cursor in _complete_url which depends on the result of each interation before going to a new possible URL. So it would need to be a synchronous process. What do you guys think?

@lipoja
Copy link
Owner

lipoja commented Jun 17, 2021

I believe synchronous process for checking domain is already in place. That is the current behavior, right?

I am thinking about keep checking DNS the way it is now for gen_urls(). I mean: when user calls this directly with check_dns enabled. We do not modify this part and every domain is check right after it is discovered (completed).

We could modify find_urls(). When user call this method it would collect all URLs first and do the concurrent DNS check as a last step of this method (just before return). At that point we should have all URLs ready for this kind of check.

@lipoja
Copy link
Owner

lipoja commented Jun 17, 2021

Second thing is that this brings complexity to user settings. We should allow users to set maximum number of concurrent processes/threads. Probably also set of timeout value. And maybe other things that I am not aware of right now.

@nicolasassi
Copy link
Author

Great! I was thinking of adding concurrency in the gen_urls() and give user option from the functions above it.
Achieve concurrency in the way you metioned would be a lot easier. Gonna work on that.

@lipoja
Copy link
Owner

lipoja commented Jun 18, 2021

OK, could you also think about how to do it so we have the code not tangled. For example: If we can separate all DNS checking parts to its own class/module? This is just and idea I did not look to the code how tricky it will be.
Also it would be nice to give user a method that he/she can call when they wants. For example when they use the gen_urls() or they do some post process filtering of URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants