WebScrapingTest

Test Repository to learn web scraping

About:

The repository was made to learn simple web scraping, specifically parsing a page and some related pages. The BBC was chosen as it open access to everyone and easy to access.

The information from this project was also used in another project.

Flow of Program:

Quite simple flow, to start off add some urls in url.txt and the program will follow each link through the program and scrape any plaintext stored in p tags.

This example is focused on BBC Bitesize as the project is needed for UoB Bristol Interaction Group's project.

Key tip to follow is if you'd like to add any buzzwords to the list of ignored buzzwords that's inline in the code as a list. For example, the signup class is ignored on the pages in on the bbc bitesize page!

Flow Explained:

The program will run through it's current url.txt file and use that as a starting point for any scraping.

It then grabs all the p tags associated with the page, ignoring certain classes (that are specified in the sourcecode). Once these tags have been grabbed, it saves all the associated text in an outputText variable that's related to that page.

The next step is to grab related links (to proceed onto related topics): To do this is quite simple, looking at the webpage we find the div other-guides__link and use that as the source of the other guides related to it. From there we grab any a href links and save the values into a list only if it's not already present.

The third step is slightly more complicated: This uses the pagination__item__inner class of links, which are associated with the in depth pages of the url, such as /revision/../1, /revision/../2 .. etc. These pages need to remain in order under the main page revision/../ to keep the topics under association.

To keep this order, a sublist is used and the smaller links of /revision/../ are kept right after the current "mainlink". This way we keep the structure.

Documentation:

Main referral link to Webscraping with python is here
Main library being tested is this
Simple guide on BeautifulSoup
This is probably the best : TRY THIS GUIDE

Running instructions

Simply run python 3 either using the run.sh or typing python scrape.py > output.txt

Bugs / Extensions:

Extension: Add the system of checking for ul and li items, that is currently ignored. The reason this is not implemented yet is a way to keep track of which <p> tag does the ul item follow, so as to not lose context on the points. A possible solution would involve removing the getAllPTags system, and instead parse through the program with a state keeping track of where the ul / li tags go.
Extension: BBC has at each image, an accessible description of the image that should be scraped as well, to keep context of the content again.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
germany_files		germany_files
immigration_files		immigration_files
.gitignore		.gitignore
README.md		README.md
output.txt		output.txt
run.sh		run.sh
saved_output.txt		saved_output.txt
scrape.py		scrape.py
text.txt		text.txt
url.txt		url.txt
urlForGermany.txt		urlForGermany.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScrapingTest

About:

Flow of Program:

Flow Explained:

Documentation:

Running instructions

Bugs / Extensions:

About

Releases

Packages

Languages

sunnyMiglani/WebScrapingTest

Folders and files

Latest commit

History

Repository files navigation

WebScrapingTest

About:

Flow of Program:

Flow Explained:

Documentation:

Running instructions

Bugs / Extensions:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages