Waldo is a proxy server that routes web traffic through other proxy servers. It's basically a meta-proxy server that tries to best route your traffic so that you do not get blocked.
Large scale web crawling can be difficult if you're crawling a single website. Most sites will block you before long, so you'll have to write some logic to pull a list of available proxy servers, handle connection pooling across those proxies, and keep track of which proxies are still alive, and which are no longer responding to your requests.
I found myself constantly re-writing this code in various projects to manage outbound proxying. This process, while necessary, got a little bit tedious, so I decided to factor out the proxying logic into a separate proxy server to handle the load balancing.
Waldo is written with Tornado, which is a highly scalable web server. I've been able to handle ~ 1,000 concurrent connections with Waldo, and I suspect it can handle significantly more than that.
With a sufficiently large proxy list, keeping track of proxies becomes difficult. Proxies often die, or need to be put in a "cool off" box so that they don't get burnt out from too much traffic. Waldo handles all of this for you.
Waldo implements the standard HTTP Proxy spec, so just connect it to it like you would any other proxy server, and it'll handle the rest for you.
When crawling a large website, you'll often find yourself stitching together various
proxy server lists. Waldo has the concept of a Finder
, which is basically a class
that pulls in a list of proxy servers for you.
First, make sure redis is installed. Then, install the python dependencies:
pip install -r requirements.txt
To run the server:
$ python server.py --port=1234
By default, waldo listens on port 1234 on all network interfaces.
To make sure it's working, try this:
$ curl -XGET http://omarish.com -x http://localhost:1234
To run the accompanying monitoring page, run the monitoring server:
$ python monitor.py
The monitoring page by default listens on port 1235.
I've been using a benchmarking utility
in benchmark.py
to will simulate heavy requests. Additionally,
Apache Bench
and Siege have been very helpful.
To run the benchmarking script:
$ python benchmark.py