Skip to content

woltsu/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-crawler

A web crawler that seeks for urls. Currently it doesn't serve any purpose apart from collecting different urls it finds across the web.

Getting Started

Prerequisites

  • Python >= 3
  • pipenv

OR

  • Docker Compose

Installing

Navigate to the project root and run the following command:

ROOT_URL=insert_root_url_here docker-compose up --build --scale web-crawler=2

This starts the web-crawler-scheduler and two web-crawlers. The scale number can be anything, but keep in mind to not overload the target server.

OR

Open two seperate terminals and navigate to both web-crawler-scheduler and web-crawler directories, where in both run first the following command:

pipenv install

and then in the web-crawler-scheduler directory run

pipenv run python main.py insert_root_url_here

and in the web-crawler directory run

pipenv run python main.py

All urls that the crawler finds will be stored into /web-crawler-scheduler/data/data.txt

Disclaimer

Please notice that this project was created just to practise network programming with Python. If you choose to test this app, be sure to not overload the servers you are targeting. Don't start too many crawlers at once and don't remove the time.sleep(1) that slows down the loop in WebCrawler.py file. I'm not liable for any misuse of this application.

About

A web crawler that crawls through the web

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published