I wrangled and analyzed the WeRateDogs (@dog_rates) Twitter data
I have successfully gathered, assessed, cleaned and visualized the twitter_archive dataset, image_predictions dataset, and json data file obtained after querying twitter API. My data wrangling process began with gathering all three datasets to be used in the project.
I directly downloaded the WeRateDogs Twitter archive data from the classroom and read it into a Pandas DataFrame, I downloaded the image predictions dataset from the url provided using the request and the os libraries.
I queried each tweet's retweet count and favourite count using the Tweepy library and stored the data in tweet_json.txt. Thereafter, I read the tweet_json.txt line by line into a pandas DataFrame with tweet_id, favourite count, and retweet count.
For the data assessing, I assessed my data both visually and progamatically, and observed the following issues:
- Some dog names are not actual names
- full html link should be replaced with the actual source in the source column
- remove retweets by dropping rows with values in the retweeted_status_id column
- drop the following columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id column, retweeted_status_user_id, retweeted_status_timestamp, text.
- after spliting timestamp column, drop the timestamp column
- drop pupper, doggo, puppo and floofer columns after merging into one "dog_stage"
- drop expanded_urls column
- The timestamp column should be split into date and time columns, and dtype of the date column should be changed to datetime
- doggo, floofer, pupper, and puppo columns should be merged into one (dog_stage)
- some dog breeds are not actual dog breeds. remove p1_dog, p2_dog, p3_dog values set as 'False: as they are not dogs of any breed
- For each row, generate each maximum p_conf value and the corresponding p and p_dog values
- change data type of tweet_id column in all three datasets to string object before merging
- Merge all three datasets
All Issues Observed were Addressed and Cleaned. After cleaning, all three datasets were merged. I made the following insights:
The most common source of tweets was Twitter for iphone, and the least common source was TweetDeck The most common dog name is Cooper A dog named 'Stephan' from the Chihuahua breed had the highest number of likes and retweets
- distribution of the top 20 dog breeds
- top 20 most common dog names
- distribution of the least 20 dog breeds
After assessing my data, i made a copy each of my three datasets before cleaning. I successfully cleaned all issues identified during assessing. After cleaning, I saved the gathered and combined dataset into master_twitter_archive.csv.
Thereafter, I generated some insights and made some visualizations