A repository to crawl the bids, advertisers, and ads on websites.
-
Update Google Chrome to the latest version
-
Clone the repository
-
Install conda (miniconda) for your system from:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html -
Create conda environment and install all the required dependencies in the environment using the .yml file:
conda env create -f environment.yml
Commands to activate and deactivate:
conda activate ad-crawler-env
conda deactivate
-
Add XVNC and Java Support:
sudo apt update
sudo apt install -y tigervnc-standalone-server default-jre wget
-
Create a directory:
consent-extension
inside the cloned project directory and cd into that directory. Next, clone Consent-O-Matic repository insideconsent-extension
.
mkdir consent-extension
cd consent-extension
git clone https://github.com/cavi-au/Consent-O-Matic.git
-
Set necessary consents by modifying values of variables
D
,A
,B
,E
,F
, andX
to eithertrue
orfalse
(default). The meaning of these variables are explained here. Set the values of these variables in the dictionary object:GDPRConfig.defaultValues
present in the following file:
/consent-extension/Consent-O-Matic/Extension/GDPRConfig.js
-
Run the crawler by provinding relevent arguments:
python3 ad-crawler.py --profile="<profile-name>" --proxyport=<proxy-port> --chromedatadir="<user-profile-dir>"
Here,profile-name
is output directory name for current set of crawls. It could be the kind of persona being crawled for instanceTrained-TV
orUntrained-TV
proxy-port
is any available port on your system that can be used by browsermobproxy for capturing HARuser-profile-dir
represent path to Google Chrome's user data directory that stores all the stateful information about the current persona being used like cookies, browsing history, etc. You should first create a blank user data directory and use it to login to the TV account through their website and then save the logged in profile in this user data directory. The directory containing this logged in information can here in future when crawling ads at the end of different stages of experimentation. To figure out the default user data directory of chrome on your system, enterchrome://version/
in the chrome browser search bar and look forProfile Path:
.
Follow the steps below to perform crawls using a docker:
(The steps 1 & 2 are highlighted for Ubuntu/Linux, but can be performed for other OS as well)
-
Check OS requirements and unistall any previous docker versions (if any):
https://docs.docker.com/engine/install/ubuntu/. -
Follow the manual installation method or any other method from the above page to install docker on your local system.
-
Set globalVar::
DOCKER
toTrue
in thead-crawler.py
file. -
Follow steps 1, 2, 5, 6, and 7 from the above normal crawling steps.
-
Build the docker using the command:
docker build -t <docker-image> .
Example:
docker build -t ad-crawler .
-
Run the docker container:
docker run -d -e PYTHONUNBUFFERED=1 -v <ad-crawler-dir>:/root -v <user-profile-dir>:/profile -p <random-unused-port>:<rfbport> --shm-size=10g <docker-image> python3.11 ad-crawler.py -p "<profile-name>" -px <proxy-port> -c "/profile" -mp "/root"
Example:
docker run -d -e PYTHONUNBUFFERED=1 -v $(pwd):/root -v /home/yvekaria/.config/google-chrome/Test:/profile -p 20000:1212 --shm-size=10g ad-crawler python3.11 ad-crawler.py -p "Test" -px 8022 -c "/home/yvekaria/.config/google-chrome/Test" -mp "/root"
Here,
rfbport
is also a random available port whole value should match the value used inad-crawler.py
. -
The flag
-d
in point 6. enables docker container to run in a detached mode from the terminal. To prevent that remove-d
. -
To monitor the running docker container use the following commands:
- To check status:
docker container ls -a | grep <docker-image>
- To check logs:
docker container logs -f <container-id>
- To delete a docker container:
docker rm -f <container-id>
- To check status:
Please contact yvekaria@ucdavis.edu
in case of any questions.