Python-Web-Scraping

A web scraper built in python to scrape iteratively all of the local paths of a specified base url that match a regular expression, up to a specified max depth. While scraping, the program saves the scraped links, and all the words found within the html with a counter for each repetition. (scrape.py)

Some instructions:

Specify the base url of interest. In main_with_depth()function.
Indicated the max_depth, limiting the number of follows the scraper will take from the initial link, within the domain. In main_with_depth() function.
Include the domain of interest in the saved_domains dictionary, specifying the html tag and class (as "content filters") used to select the content to scrape. Also the regular expression to be used to obtain the list of local paths to be scrape that match that regex pattern.
Specify the language of the stop words package (e.g. en, es). Also include your own stop words (my_stop_words list) that will be omitted when saving the text. In clean_up_word() function.
Check result files in “csv” folder with the scraped content (scrapped links; words and their count.)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python-Web-Scraping

Some instructions:

Python packages required:

About

Releases

Packages

Contributors 2

Languages

jpablogomezb/Python-Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Python-Web-Scraping

Some instructions:

Python packages required:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages