Skip to content

Latest commit

 

History

History
12 lines (6 loc) · 1.22 KB

README.md

File metadata and controls

12 lines (6 loc) · 1.22 KB

Sentiment analysis of Blog authorship corpus Data

This undertaking employs advanced text analytics techniques to effectively analyze and summarize an extensive corpus of blogger content.

The initial step involved reading and understanding the dataset by importing it into a data frame, eliminating null and duplicate values, and visually inspecting random data points to gain insight into the structure and content of the data.

Subsequently, various data cleaning techniques were applied, including dropping unimportant columns, removing all numbers, symbols and extra spaces, and standardizing all text to upper and lowercase alphabets. This was followed by the elimination of stop words, non-English words, misspelled words, and chat acronyms, to ensure a high level of data integrity.

To gain a deeper understanding of the content, advanced text analysis techniques were applied, such as computing polarity and subjectivity values across various demographics, and visualizing their distribution.

Furthermore, trends in word usage were identified across various demographics and blog categories using advanced data visualization techniques such as word clouds, providing a comprehensive understanding of the corpus and its underlying themes.