Skip to content

Scraping the Web

Bradley Aaron Kohler edited this page Apr 30, 2020 · 11 revisions

Scraping the Web the Naive Method

The Naive Method of scraping the web is using the static tag, and static attributes (key and value pairs).

Using BeautifulSoup4 we can scrape the following HTML text encapsulated by the tag

<div class="location"> Some text in here... </div>
s.find('div', attrs={'class': 'location'}).text.strip()

Here the static tag is 'div', the static attribute key and value pair is 'class': 'location'.

The advantage of the Naive Method is that it is incredibly accurate. The disadvantage of the Naive Method is that it must be consistently maintained, that is, the web page HTML format may update over time.

Scraping the Web Using Cosine Similarity

The Cosine Similarity Method allows us to look for a more abstract tag, or attribute.

Using SciKitLearn we can construct a cosine similarity matrix.

  1. Extract all tags into a list using the find_all method from BeautifulSoup4.
tags = [a.attrs for a in s.find_all(self.html_tags)]
  1. Determine the key words to search for in the key and values pairs.
Clone this wiki locally