Skip to content

UWCSESecurityLab/adscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adscraper: A Web Crawler for Measuring Online Ad Content

About Adscraper

Adscraper is an open source research tool for automatically scraping the content of ads on the web. Given a list of URLs, Adscraper visits each URL in a Chromium browser, and can collect the following data about the ads that appear of the page:

  • Screenshots of ads
  • Ad URLs
  • Ad landing pages
  • Third-party tracking requests

The core Adscraper crawler is a Node.js script, powered by Puppeteer, a browser automation library for the Chromium browser. You can run a small number of crawlers using this script directly. For bigger experiments, you can run many parallel crawler instances, distributed across multiple workers, using the crawl-cluster tool, which runs Adscraper as a Kubernetes Job workload.

Research using Adscraper

Adscraper has been used to conduct research measuring and auditing the online ads ecosystem. You can read about some of the projects that used Adscraper below:

Citations

If you used Adscraper in your research project, please cite the repository using the following BibTeX:

@software{Zeng_adscraper,
  author = {Eric Zeng},
  license = {MIT},
  title = {Adscraper: A Web Crawler for Measuring Online Ad Content},
  url = {https://github.com/UWCSESecurityLab/adscraper},
  version = {1.0.0},
  date = {YYYY-MM-DD}
}

Warning: Research Code!

Adscraper is a research tool, and may contain bugs! If you are running into issues with the code or documentation, please let us know by filing an issue or asking a question in the discussions. I will also accept pull requests for fixing bugs, doc bugs, or making the project more generally usable and configurable.

Usage (basic crawls)

For detailed instructions on how to set up Adscraper, please read crawler/README.md.

Pre-requisites

To run Adscraper, you must have the following software installed:

  • Node.js
  • PostgreSQL

Installation

First, clone the project, install dependencies, and build the project:

git clone https://github.com/UWCSESecurityLab/adscraper.git
cd adscraper/crawler
npm install
npm run build

Then, create tables in the Postgres database to store the metadata from the crawls.

cd ../..
psql -U <YOUR_POSTGRES_USERNAME> -f ./adscraper.sql

Lastly, create a JSON file named pg_conf.json containing the authentication credentials for your Postgres database.

{
  "host": "localhost",
  "port": 5432,
  "database": "adscraper",
  "user": "<your postgres username>",
  "password": "<your postgres password>"
}

Creating a crawl list

Next, create a crawl list - the URLs that the crawler will visit. The format is a text file containing one URL per line. For example, a crawl list named crawl_list.txt might look like:

https://www.nytimes.com/
https://www.cnn.com/
https://www.espn.com/
https://www.stackoverflow.com/

Running crawls

From the crawler/ directory, run the ``crawler-cli'' script to start the crawl. This script will scrape the content of the ads on the pages in the crawl list, and click on the ads to get the Ad URL, but block the ads from opening.

node gen/crawler-cli.js \
    --name my_crawl_name \
    --output_dir /path/to/your/output/dir \
    --crawl_list /path/to/your/crawl_list.txt \
    --pg_conf_file /path/to/your/pg_conf.json \
    --scrape_ads \
    --click_ads=clickAndBlockLoad

The data will be stored in two places:

  1. Crawl metadata is stored in the Postgres database
    • e.g.. for each ad, the ad URL, the page the ad appeared on, when the ad was crawled
  2. The screenshots of ads and HTML content of pages is stored in the directory output_dir.
    • The location of these files are specified in the metadata for each ad and page, in the columns ad.screenshot, page.screenshot, page.html, etc.

Detailed instructions and advanced usage

For detailed instructions on how to set up Adscraper, and examples of different types of crawls you can run to answer different research questions, please read crawler/README.md.

Usage (distributed crawls)

Do you need to run tens, or even hundreds of crawls with different browser profiles? Or do you need to parallelize crawls over thousands of URLs? The crawl-cluster tool is a Kubernetes-based solution for deploying Adscraper crawl jobs in parallel across multiple machines.

crawl-cluster is a script that takes a JSON crawl specification file as input, and automatically generates and launches a Kubernetes Job, which automatically deploys Adscraper crawler instances to a Kubernetes cluster.

Pre-requisites

To run a Adscraper cluster, you must run the following services:

  • Kubernetes on each node (Recommended distribution: k3s)
  • A PostgreSQL database server, set up as described in the basic crawl instructions
  • A distributed file system or server (e.g. NFS, SMB/CIFS)

Creating input files

Distributed crawls are configured using a JSON file, that specifies the crawler options, as well as the profiles and URLs to crawl.

For example, let's say you wanted to crawl ads shown to two hypothetical browsing profiles: one for a user interested in sports and another for a user interested in cooking.

First, create the crawl lists for each profile:

sports_crawl_list.txt:

https://www.espn.com
https://www.nba.com
https://www.mlb.com

cooking_crawl_list.txt:

https://www.seriouseats.com
https://www.foodnetwork.com
https://www.allrecipes.com

Then, you can create a job specification, that specifies the crawler behavior, and which profiles and crawl lists to use:

example-job.json:

{
  "jobName": "example-crawl",
  "dataDir": "/home/pptruser/data",
  "maxWorkers": 2,
  "profileOptions": {
    "useExistingProfile": false,
    "writeProfileAfterCrawl": true
  },
  "crawlOptions": {
    "shuffleCrawlList": false,
    "findAndCrawlPageWithAds": 0,
    "findAndCrawlArticlePage": false
  },
  "scrapeOptions": {
    "scrapeSite": false,
    "scrapeAds": true,
    "clickAds": "clickAndBlockLoad",
    "captureThirdPartyRequests": true
  },
  "profileCrawlLists": [
    {
      "crawlName": "profile_crawl_sports",
      "crawlListFile": "/home/pptruser/data/inputs/example-job/sports_crawl_list.txt",
      "crawlListHasReferrerAds": false,
      "profileDir": "/home/pptruser/data/profiles/sports_profile"
    },
    {
      "crawlName": "profile_crawl_cooking",
      "crawlListFile": "/home/pptruser/data/inputs/example-job/cooking_crawl_list.txt",
      "crawlListHasReferrerAds": false,
      "profileDir": "/home/pptruser/data/profiles/cooking_profile"
    },
  ]
}

Place these input files in a folder on the distributed file system, so that they can be read by the Kubernetes workers.

Running a crawl

To start the job, run the runIndexedJob.js script:

cd adscraper/crawl-cluster/cli
npm install
npm run build

node gen/runIndexedJob.js -j /path/to/your/example-job.json -p /path/to/your/pg_conf.json

To monitor the progress of the job, you can use the kubectl command to view the status of the crawl worker containers:

# To view overall job progress
kubectl describe job <job-name>

# To view statuses of each crawl instance
kubectl get pods -o wide -l job-name=<job-name>

# View active crawl instances
kubectl get pods -o wide --field-selector status.phase=Running

# To view the logs of a specific crawler (for debugging)
kubectl logs <pod-name>

Like in the basic crawl, the data is stored in two places:

  1. Crawl metadata is stored in the PostgreSQL database
  2. The screenshots of ads and HTML content of pages is stored in the directory dataDir, which is a location in the distributed file system.

Detailed instructions and advanced usage

For full instructions on setting up the cluster and running crawls refer to the documentation in crawl-cluster/README.md.

Viewing and analyzing data

Though there is no built-in tool for analyzing crawl data, you can use SQL queries to export the data from the Postgres database, and use your favorite data analysis tool, like Pandas or R, to analyze the data.

For example, for the example crawl above, you can run the following commands in PSQL to export CSVs containing the metadata for the ads and their parent pages:

\copy(SELECT page.id as page_id, crawl_id, url, original_url FROM page JOIN crawl ON page.crawl_id = crawl.id WHERE crawl.name = 'my_crawl_name') to 'ads.csv' csv header;

\copy(SELECT ad.id as ad_id, crawl_id, parent_page, url as ad_url, screenshot FROM ad JOIN crawl ON ad.crawl_id = crawl.id WHERE crawl.name = 'my_crawl_name') to 'pages.csv' csv header;

Then, in pandas, you can read and analyze the metadata yourself:

import pandas as pd

# Read CSVs
ads = pd.read_csv('ads.csv')
pages = pd.read_csv('pages.csv')

# Merge ad and page tables
df = pd.merge(ads, pages, left_on='parent_page', right_on='page_id')

# Count ads per parent page
print(df['url'].value_counts())

# Count most popular ad URL domains
import urllib.parse
print(df['ad_url'] \
  .apply(lambda x: urllib.parse.urlparse(x).netloc) \
  .value_counts())

To answer more complex research questions about the content of ads, you will likely need to label the ads. This is beyond the scope of this project, but in past research projects, we've used tools and methods like:

  • Manually labeling ad screenshots and landing pages using Label Studio
  • Using OCR to extract text from ad screenshots, and using NLP tools like text classifiers, topic models, and LLMs to identify topics
  • Scraping the landing pages of ads, and using NLP tools to identify topics

About

A web crawler for scraping online ad content

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages