What is this?

As part of our CPIT221 course, we write a weekly blog. In our previous week's blog, we wrote about our experiences in a group discussion, where the members were chosen at random.

As someone who likes data, it was a no brainer to scrape and visualize that data, so this is the end result:

If it wasn't obvious, the size of the text correlates to how frequent the word appears in all the blogs.

How does it work?

I split the code for this project into three distinct parts:

Scraping Data
Processing Data
Visualizing Data

Of course, to achieve this, we will need to install some dependencies.

Dependencies

To view and run the code, you will need to use a Jupyter Notebook. Alternatively, Read this and run the .py files.

Visual Studio Code has Jupyter Notebook support built into it's Python Extention, and that is what I have used for this project.

Additionally, you will need to install:

Selenium Webdriver Library
Selenium Chrome Driver, Make sure to either add it to the PATH or place it in the project's Current Working Directory
Matplotlib
WordCloud

Scraping the Data

This was achieved by using Selenium's Webdriver for Python. Although it's intended use is for Integration Testing, we can also use it to automate browser actions and effectively scrape data from the browser.

Previously, I have used Python's Interactive Shell for web scraping and browser automation. However, with this project I decided to finally use Jupyter Notebooks, which has autocompletion and allows you to tinker with the code in real time. The main application of Jupyter Notebooks here is the ability to run code in blocks called 'Cells' which is useful in the tedious process of scraping data.

Initialize the Browser Driver

driver = webdriver.Chrome()

Opening a site

driver.get("https://website.web.edu.sa")

Then, we will manually log in and move to the weekly writing blogs.

Retrieving Blog Links

After correctly moving to the right page, we can scrape all the blog links with the code below.

def getBlogs():
    # Get List Element containing all the blog links
    blogList = driver.find_element_by_xpath("/html/body/div[5]/div[2]/div/div/div/div/div[3]/div/div[2]/div[4]/div/ul")

    # Retrieve children of <ul> element
    return blogList.find_elements_by_class_name("user")

def getLinks(blgs):
    # Retrieve anchor links in every list element
    return [e.find_element_by_tag_name("a").get_property("href") for e in blgs]

blogs = getBlogs()
blogLinks = getLinks(blogs)

Retrieving Text Data

For Every URL, we will scrape the text from each blog as follows:

def loadPageAndScrapeBlog(url):
    sleep(2) # Sleep while the previous page loads
    driver.get(url) # Redirect to blog URL
    e = driver.find_elements_by_class_name("entryText") # Select the element containing the blog text
    return e[0].text # Obtain the raw text from the latest blog element

blogTexts = [loadPageAndScrapeBlog(url) for url in blogLinks] # Repeat for every URL

We now have all the text we need to start processing the data.

However to avoid losing the scraped text, I have saved it as a JSON file:

# File Context Manager, closes the file pointer automatically
with open("rawBlogText.json", 'w') as file: 
        json.dump(blogTexts, file, indent=2) # Save data as JSON

Processing Data

Although we could go straight into visualizing the data, there is some inconsistencies in the data we can filter out, such as spacing, capitalization, and newline characters (\n).

Loading the Data

The first step here is to load the text array which we have saved last time, and append it together into one big string.

import json
import re # Regular Expressions Library
from collections import Counter

# Retrieve data
with open("rawBlogText.json", "r") as f: 
    data = json.load(f) 

# Concatenate Data
textData = " ".join(data)

Removing newlines

textData = textData.replace("\n", " ") # Simply replace the newlines with a " "

Selecting all the words present

To achieve this I wrote a Regular Expression, to match strings which match a certain pattern. Here, it is looking for any case insensitive sequence of characters which contains letters from A to Z, or '.

words = re.findall(r"[a-z\']+", newData, flags=re.IGNORECASE) # Get individual words

This will give us an array of all the words present in the string.

['hello', 'this', 'IS', 'SOME', 'tEsT', 'Data', "isn't", 'it', 'Yes', 'it', 'is',
'Some', 'soMe', 'repeTitive', 'wOrds', 'worDs']

However, as you can see, there are repeating words, however we need to convert them into lowercase.

words = [w.lower() for w in words]

Result:

['hello',  'this', 'is', 'some', 'test', 'data', "isn't", 'it', 'yes', 'it', 'is',
 'some', 'some', 'repetitive', 'words', 'words']

Getting the Frequency of Each Word

You could write a complex algorithm to efficiently find the number of repetitions in an array. However, this is Python, where there is a library for everything. In the collections package, there is a Counter class which we can use for this purpose.

# Get a sorted dictionary of the top 1000 words from an array
sortedWords = dict(Counter(words).most_common(1000))

This is the final form our data will take, and we can now easily visualize it in the next section.

{
  "some": 3, "is": 2, "it": 2, "words": 2, "hello": 1, "this": 1, "test": 1, "data": 1, "isn't": 1, "yes": 1, "repetitive": 1
}

However, you will need to remove the meaningless words manually, such as it, and, a, the, and so on.

Saving the Data

As with the previous section, we will save the data as a JSON file.

with open('processedData.json', "w") as file:
    json.dump(sortedWords, file, indent=2)

Visualizing the Data

To visualize the data, I used the wordcloud library.

Most of this section is straight out of the Wordcloud Documentation, however I'll try my best at explaining how the code functions.

Load the processed data

# Load Word/Frequency Data
wordAndFrequencyData = json.load(open("processedData.json", "r"))

Load the target image

From reading the documentation, I realized that you could superimpose text into an existing picture, called a mask. Therefore, I chose the FCIT Logo for this purpose.

# Load Image which text will be superimposed onto
imageMask = np.array(Image.open("logo.jpg"))

# Generate colors from Image
imageColors = ImageColorGenerator(imageMask)

Creating the WordCloud Object

We will then create a WordCloud object,

Set the scale to 5, This number configures the resolution of the image
Change the background to white
Set the mask to the Loaded Image

# Create WordCloud object with image
wordCloudObject = WordCloud(scale=5, background_color="white", mask=imageMask)

We will then pass in the Word - Frequency data, which is already in the form needed for the Word Cloud.

# Pass in Word: Frequency to be displayed
wordCloudObject.generate_from_frequencies(wordAndFrequencyData)

Then, we wil pass in the WordCloud object into matplotlib, and display the chart.

# Remove Axis markings
plt.axis("off")

# Display as Matplotlib Chart
plt.imshow(wordCloudObject.recolor(color_func=imageColors), interpolation="bilinear")
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
pythonFiles		pythonFiles
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
display.ipynb		display.ipynb
parser.ipynb		parser.ipynb
processedData.json		processedData.json
rawBlogText.json		rawBlogText.json
scraper.ipynb		scraper.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

What is this?

How does it work?

Dependencies

Scraping the Data

Initialize the Browser Driver

Opening a site

Retrieving Blog Links

Retrieving Text Data

Processing Data

Loading the Data

Removing newlines

Selecting all the words present

Getting the Frequency of Each Word

Saving the Data

Visualizing the Data

Load the processed data

Load the target image

Creating the WordCloud Object

About

Releases

Packages

Languages

License

RyanSamman/BlogWordCloud

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

What is this?

How does it work?

Dependencies

Scraping the Data

Initialize the Browser Driver

Opening a site

Retrieving Blog Links

Retrieving Text Data

Processing Data

Loading the Data

Removing newlines

Selecting all the words present

Getting the Frequency of Each Word

Saving the Data

Visualizing the Data

Load the processed data

Load the target image

Creating the WordCloud Object

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages