Drug Sentiment Analysis

Problem Statement

This is a typical NLP task where we have to predict the sentiment of the users from their reviews.

Data Description

The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.

Attribute Information:

drugName (categorical): name of drug
condition (categorical): name of condition
review (text): patient review
rating (numerical): 10 star patient rating
date (date): date of review entry
usefulCount (numerical): number of users who found review useful

Exploratory Data Analysis

Through our EDA we were able to find some interesting facts about our data. Below are the things that we found

The top condition were pain, birth control and high blood pressure.
Most the conditions have multiple drugs.
A single drug can be used for multiple conditions.
The number of reviews has increased in last 3 years.
As the ratings has increased average yearly rating has decresed.
Ratings of the users are extreme. They only rate when they find the drug very effective or the drug shows some side effects.
Most of the usefulCounts are distributed between 0 and 200.

Data Pre-Processing

Check null values. And delete them if they are neligible.
Check for Duplicate records and remove them.
Check for Noise like special characters or data with no meaning and remove them.

Pre-Processing Reviews

Remove HTML tags
- Using BeautifulSoup from bs4 module to remove the html tags. We have already removed the html tags with pattern "64</span>...", we will use get_text() to remove the html tags if there are any.
Remove Stop Words
- Remove the stopwords like "a", "the", "I" etc.
Remove symbols and special characters
- We will remove the special characters from our reviews like '#' ,'&' ,'@' etc.
Tokenize
- We will tokenize the words. We will split the sentences with spaces e.g "I might come" --> "I", "might", "come"
Stemming
- Remove the suffixes from the words to get the root form of the word e.g 'Wording' --> "Word"

TfidfVectorizer (Term frequency - Inverse document frequency)

TF - Term Frequency :-
How often a term t occurs in a document d.
TF = (Number of occurences of a word in document) / (Number of words in that document)
Inverse Document Frequency
IDF = log(Number of sentences / Number of sentence containing word)
Tf - Idf = Tf * Idf

Model Building

Split dependent and independent features.
Split the data into train and test in 67 : 33 ratio.
Fit MultinomialNB and RandomForestClassifier on training set, predict and check accuracy.

Conclusion

After applying the TfidfVectorizer to transform our reviews in Vectors and applying NaiveBayes and RandomForestClassifier we see that RandomForestClassifier outperforms MulinomialNB. We have achieved accuracy of 89.7 % after applying RandomForestClassifier without any parameter tuning. We can tune the parameters of our classifier and improve our accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Drugs Sentiment Analysis.ipynb		Drugs Sentiment Analysis.ipynb
README.md		README.md
Sample README.md		Sample README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drug Sentiment Analysis

Problem Statement

Data Description

Exploratory Data Analysis

Data Pre-Processing

Pre-Processing Reviews

TfidfVectorizer (Term frequency - Inverse document frequency)

Model Building

Conclusion

About

Releases

Packages

Languages

manvendra7/drug_sentiment_analysis

Folders and files

Latest commit

History

Repository files navigation

Drug Sentiment Analysis

Problem Statement

Data Description

Exploratory Data Analysis

Data Pre-Processing

Pre-Processing Reviews

TfidfVectorizer (Term frequency - Inverse document frequency)

Model Building

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages