ad-crawler

A repository to crawl the bids, advertisers, and ads on websites.

Steps to setup Crawler

Update Google Chrome to the latest version
Clone the repository
Install conda (miniconda) for your system from:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
Create conda environment and install all the required dependencies in the environment using the .yml file:
conda env create -f environment.yml
Commands to activate and deactivate:
conda activate ad-crawler-env
conda deactivate
Add XVNC and Java Support:
sudo apt update
sudo apt install -y tigervnc-standalone-server default-jre wget
Create a directory: consent-extension inside the cloned project directory and cd into that directory. Next, clone Consent-O-Matic repository inside consent-extension.
mkdir consent-extension
cd consent-extension
git clone https://github.com/cavi-au/Consent-O-Matic.git
Set necessary consents by modifying values of variables D, A, B, E, F, and X to either true or false (default). The meaning of these variables are explained here. Set the values of these variables in the dictionary object: GDPRConfig.defaultValues present in the following file:
/consent-extension/Consent-O-Matic/Extension/GDPRConfig.js
Run the crawler by provinding relevent arguments:
python3 ad-crawler.py --profile="<profile-name>" --proxyport=<proxy-port> --chromedatadir="<user-profile-dir>" Here,
- profile-name is output directory name for current set of crawls. It could be the kind of persona being crawled for instance Trained-TV or Untrained-TV
- proxy-port is any available port on your system that can be used by browsermobproxy for capturing HAR
- user-profile-dir represent path to Google Chrome's user data directory that stores all the stateful information about the current persona being used like cookies, browsing history, etc. You should first create a blank user data directory and use it to login to the TV account through their website and then save the logged in profile in this user data directory. The directory containing this logged in information can here in future when crawling ads at the end of different stages of experimentation. To figure out the default user data directory of chrome on your system, enter chrome://version/ in the chrome browser search bar and look for Profile Path:.

To crawl inside a docker container

Follow the steps below to perform crawls using a docker:
(The steps 1 & 2 are highlighted for Ubuntu/Linux, but can be performed for other OS as well)

Check OS requirements and unistall any previous docker versions (if any):
https://docs.docker.com/engine/install/ubuntu/.
Follow the manual installation method or any other method from the above page to install docker on your local system.
Set globalVar::DOCKER to True in the ad-crawler.py file.
Follow steps 1, 2, 5, 6, and 7 from the above normal crawling steps.
Build the docker using the command:
docker build -t <docker-image> .
Example:
docker build -t ad-crawler .

Run the docker container:

docker run -d -e PYTHONUNBUFFERED=1 -v <ad-crawler-dir>:/root -v <user-profile-dir>:/profile -p <random-unused-port>:<rfbport> --shm-size=10g <docker-image> python3.11 ad-crawler.py -p "<profile-name>" -px <proxy-port> -c "/profile" -mp "/root"

Example:

docker run -d -e PYTHONUNBUFFERED=1 -v $(pwd):/root -v /home/yvekaria/.config/google-chrome/Test:/profile -p 20000:1212 --shm-size=10g ad-crawler python3.11 ad-crawler.py -p "Test" -px 8022 -c "/home/yvekaria/.config/google-chrome/Test" -mp "/root"

Here, rfbport is also a random available port whole value should match the value used in ad-crawler.py.

The flag -d in point 6. enables docker container to run in a detached mode from the terminal. To prevent that remove -d.
To monitor the running docker container use the following commands:
- To check status: docker container ls -a | grep <docker-image>
- To check logs: docker container logs -f <container-id>
- To delete a docker container: docker rm -f <container-id>

Please contact yvekaria@ucdavis.edu in case of any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
code		code
data		data
webdriver-spoofer-extension		webdriver-spoofer-extension
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
ad-crawler.py		ad-crawler.py
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ad-crawler

Steps to setup Crawler

To crawl inside a docker container

About

Releases

Packages

Languages

Yash-Vekaria/ad-crawler

Folders and files

Latest commit

History

Repository files navigation

ad-crawler

Steps to setup Crawler

To crawl inside a docker container

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages