GitHub - zhucebuliaolongchuan/customized-keyword-web-crawler: A general purpose web crawler based on PageRank Method

A Customized Keyword Web Crawler based on PageRank

Output Logs: Two queries with two different spiders, which are BFS spider and PageRank spider, so there are four files. Please note that, to make the output results well format, I output the results as an excel file.
- The file name is named in this format:<Query>_<Spider>_Spider_Results.xls
1 Python File: This file contains all the source code for this project. There are around 470 lines including comments.
2 Bloomfilter Files: These two files are used to check the same content file that accessed before using Bloomfilter method. But they are the output files by the program. Please do not use them when you re-run the program. It might cause the program would detect the content that have crawler last time.

Language Version: Python 2.7.14
Dependencies: There are three dependencies at all: BeautifulSoup4, Google, BloomFilter. If you are not make sure whether you have installed the above three python modules, please type the following three commands to install the dependencies。
- pip install beautifulsoup4
- pip install google
- pip install pybloomfiltermmap
Two ways to compile this program:
- Type "python web_crawler.py" to the terminal
- Use other Python IDE(such as PyCharm) is also well

2. Then run the program with the terminal or by Python IDE!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
sample_output		sample_output
LICENSE		LICENSE
README.md		README.md
ebbets field_bf		ebbets field_bf
knuckle sandwich_bf		knuckle sandwich_bf
web_crawler.py		web_crawler.py