Replies: 6 comments
-
This is really cool. I appreciate empirical-based approaches to decision making, so thanks for putting this together. I honestly haven't looked into the other projects' methods for query submission (network i/o) but I'd assume they're probably very similar performance-wise (besides the asyncio stuff here). I'd bet that the majority of the slowness has to do with the text "parsing". For instance, asyncwhois uses But with all that said, I'll definitely see if I can dive a little deeper into to your benchmark script and try to identify what's causing a majority of the slowness and get back to you. |
Beta Was this translation helpful? Give feedback.
-
I've retried but I've modified asyncwhois to use async by default, yet still slower than whoisdomain The result is a bit different, a bit faster but the weird thing is that the result is different everytime I rerun it:
|
Beta Was this translation helpful? Give feedback.
-
The main reason you're going to get different speeds each time is due to network i/o. Either the whois server will take slightly longer to respond because it's handling other requests, your IP is rate-limited by the server, or some other network-related jitter will skew your results. (Sometimes a country's whois server will go completely offline without warning for maintenance or some other government event happening.) All of these libraries are going to have to deal with that. The only advantage asyncwhois gives you is that it supports asyncio meaning it can take advantage of cooperative multitasking and better communicate to other parts of your program that it's about to do some i/o related stuff, pause its execution, and allow other parts of your program to run their instructions. However, your script is performing each query sequentially, so there's really not much of a difference in using the asyncio vs normal methods here. Given this information, your benchmark script is still solid; you'll either need to run it multiple times (like 10-100) to get more robust, averaged results or shift your focus more on the memory usage/speed/quality of the parsing abilities of each library. This library decouples the "query" from the "parsing". That is, the logic for doing the network stuff and the logic for parsing the big text blob from the server are separated. Check out |
Beta Was this translation helpful? Give feedback.
-
Hello, Thanks for your serious and solid answer 👍 Yes, the script I've done has several limits to get solid conclusions but there is some trends. Plus the differences in the result themselves is surprising. One of mitigation for asyncwhois I could suggest is to add the capacity (as an option like RDAP) to use the whois Unix tool instead of python networking for multiple reason: native caching, faster, less chance to be blocked by fingerprinting since it's seen as a legit tool. Thanks for your work! |
Beta Was this translation helpful? Give feedback.
-
@baderdean I thought this might interest you. I used an LLM to parse WHOIS data and it does pretty well. It has some drawbacks like speed, but overall LLM's seem like great candidates to simplify regex related problems like WHOIS parsing. source: https://github.com/pogzyb/port43?tab=readme-ov-file#basic-example-whois |
Beta Was this translation helpful? Give feedback.
-
Thanks for the study! Imho here LLM is more useful to write the parsing code or regex rather than parsing on the fly for performance and cost purpose. |
Beta Was this translation helpful? Give feedback.
-
Parsing Whois data is hard, especially because the format differ depending on the TLD. I've a specific issue with registrant value. So I decided to test multiple python whois library (pythonwhoisalt, asyncwhois, whoisit, whoisdomain) against the registrant field using google's domain dataset and check for their speed too. Initially, I was using whoisdomain, so here the initial post on its github: mboot-github/WhoisDomain#21
Here the script I wrote: https://gist.github.com/baderdean/cc4643ecd95d3ccde31dee80ebdbea28
Asyncwhois was the best in term of quality, yet slower than whoisdomain by far. Is it something some one could reproduce to tell if that's related to my specific case or is it generic?
And here the results:
PS: I've created similar issues in other projects as well.
Beta Was this translation helpful? Give feedback.
All reactions