-
Notifications
You must be signed in to change notification settings - Fork 221
Scraping the Web
Bradley Aaron Kohler edited this page Apr 30, 2020
·
11 revisions
The Naive Method of scraping the web is using the static tag, and static attributes (key and value pairs).
Using BeautifulSoup4 we can scrape the following HTML text encapsulated by the tag
<div class="location"> Some text in here... </div>
s.find('div', attrs={'class': 'location'}).text.strip()
Here the static tag is 'div'
, the static attribute key and value pair is 'class': 'location'
.
The advantage of the Naive Method is that it is incredibly accurate. The disadvantage of the Naive Method is that it must be consistently maintained, that is, the web page HTML format may update over time.
The Cosine Similarity Method allows us to look for a more abstract tag, or attribute.
Using SciKitLearn we can construct a cosine similarity matrix.
- Extract all tags into a
list
using thefind_all
method from BeautifulSoup4.
tags = [a.attrs for a in s.find_all(self.html_tags)]
- Determine the key words to search for in the key and values pairs.