The web is a wealth of information, not all of it is easily accesible in a "data format" like RSS or an API. Web scrapers can turn inaccessible information into actionable inputs.
In this repo I'll try to provide some examples of various strategies for web scraping. I'll also document the various tools and packages I find useful along with various other tips. This repo is primarily concerned with web scraping using Node.js.
First tip: Always curl
the target site to understand what comes back, the -O
and -L
flags are your friends.
Web scraping according to Wikipedia is:
a computer software technique of extracting information from websites.
I've started writing scrapers anytime I run into a site that meets any of the following criteria:
- The site does not provide a RSS for content that should have a RSS feed (looking at you Youtube!)
- The site is selling tickets or having an auction and I'd like to know immediately whenever the status changes.
- The site is so damn awful to look at and covered with so many ads (seriously, I made billions working from home, find out how!!!) that the content is impossible to get to.
If you followed the first tip and curl
ed the target site and saw the information you were after, then you're almost done! All that's left is to use request + cheerio to automate the retrieval and parsing of the information for you. Request is the HTTP client for Node. It has a ton of options, but we will only need the basics to scrape. Cheerio is the best Node module I've ever come across. It's awesome. Cheerio provides a "jQuery-like" experience with any DOM string. Just load it up and you can start doing stuff like this:
//html will probably come from a `request` response
//it could also come from a file, `fs.readFileSync('dump.html').toString();`
const $ = cheerio.load(html);
$('.products').each((i, e) => {
const text = $(e).text();
if (text === 'Xbox One') {
this.sendAlert();
}
});
Pretty great eh? Looks like just like typical jQuery stuff. Parse and filter to your heart's content. Did I mention cheerio is insanely fast too? It is.
Scraping is all easy and painless with sites that work without JS (like the scenario above βοΈ). This is not the case for sites behind Javascript authentication or sites that are Javascript applications. Thankfully we have headless Chrome. Headless Chrome let's you run an instance of Chrome (headlessly). One approach I've used with success is to limit the amount of work I do inside headless. I'll navigate to the page where my info is and then dump the entire page's worth of HTML to a file and pass it all off to cheerio which handles traversing and parsing that markup much better than headless Chrome. I've also used it to pull the relevant authentication information I need to then make API requests via normal axios
or curl
commands.
Wordpress runs half the web (or that's how it seems). This is great news for scrapers since most WP installs have several RSS feeds to pull data from. So if you come across a site and you figure out it's a Wordpress install, check if any of the various built-in feeds work.
For parsing a RSS feed, saxjs is the best module to use.
So you've scraped the site and you have the information, now what? Well now you can send yourself a SMS (Twilio is awesome and very cheap) or an email (Sendgrid has a great free plan) with the relevant info. Or save the info in a database (e.g. SQLite). It's up to you at this point.
Now we need to setup how often our scraper will run and where we will host it.
Scrapers are typically very light on resource usage, I've paired several on the same VPS as my blog without a single issue. If you don't have a server to host a scraper, I'd recommend spending a few bucks on a "Lowend" box. Even 128MB
of RAM is plenty.
I've found using Ansible to be nice for deploying and setting up scrapers. It makes pushing updates (HTML
changes pretty often) easy and painless.
I use cron to run all my scrapers at set intervals.
Important please be courteous of all websites and throttle your scraper to only hit a site a few times per hour. Not only is it the respectful thing to do, but it'll also prevent you from getting your IP banned. The most often I have a scraper running at the moment is once a hour.
Speaking of cron, Ansible has a great cron module.
Since this entire cookbook is based around using Node for scraping the following packages are all Node related. Two tools I haven't mentioned yet that deserve a mention are loadsh and asyncJS. Lodash provides some great utility methods to work with page data. AsyncJS provides a way to speed up and organize your code if you find yourself sending multiple request
s at once or doing other asynchronous programming.
- request - HTTP client
- cheerio - DOM parser
- headless Chrome - "headless" Chrome
- lodash - utility library
- async - tools for dealing with asychronous code
- saxjs - XML (rss/feed) parser
- cron - schedule jobs
- sqlite - database of choice, very fast and low resource usage
- noodle - a Node.js server and module for querying and scraping data from web documents.
- W3 RSS Validator - RSS validator
- RSS Validator - RSS validator
- User agent strings for bots - Wikipedia guidelines for bots
I plan to keep this cookbook up to date with new tools and methods as I find them. I also plan to include some additional "sample projects" as I find time to commit them. For now, you can reference my ESPN Scraper + RSS feed repo as an example of a request
+ cheerio
+ sqlite
project.