-
Notifications
You must be signed in to change notification settings - Fork 213
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
00cdcd5
commit eff6e52
Showing
8 changed files
with
349 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
## **WhatsApp Chat Analyzer** | ||
|
||
### 🎯 **Goal** | ||
|
||
The main goal of this project is to analyze WhatsApp chat data to derive meaningful insights, such as user engagement, emoji usage, and response patterns. It aims to provide an interactive visualization to explore user activity within a chat. | ||
|
||
### 🧵 **Dataset** | ||
|
||
The dataset is provided by the user in the form of a `.txt` or `.zip` file, which contains WhatsApp chat data exported from the messaging app. | ||
|
||
### 🧾 **Description** | ||
|
||
This project allows users to upload WhatsApp chat files, which are then processed to extract useful information, including message counts, emoji usage, and response times. The data is visualized using interactive charts and plots, enabling easy exploration of user behavior in the chat. | ||
|
||
### 🧮 **What I had done!** | ||
|
||
1. Implemented a file upload feature to allow users to upload WhatsApp chat files (`.txt` or `.zip`). | ||
2. Processed the uploaded chat data to extract relevant information using regular expressions. | ||
3. Extracted details such as date, time, author, message content, and emojis. | ||
4. Analyzed message data to calculate statistics like messages sent, average response time, word count, and emoji usage. | ||
5. Visualized the extracted information using various charts such as pie charts, bar charts, and polar plots. | ||
6. Generated a word cloud of the most frequently used words in the chat. | ||
|
||
### 🚀 **Models Implemented** | ||
|
||
No machine learning models were used in this project. Instead, the focus is on text processing, data analysis, and visualization of chat data. | ||
|
||
### 📚 **Libraries Needed** | ||
|
||
- `streamlit` - For creating the web application. | ||
- `pandas` - For handling and processing data. | ||
- `plotly.express` and `plotly.graph_objs` - For creating interactive visualizations. | ||
- `matplotlib.pyplot` and `seaborn` - For additional plotting and visualization. | ||
- `wordcloud` - For generating a word cloud. | ||
- `nltk` - For stopwords and text processing. | ||
- `emojis` - For extracting emojis from messages. | ||
- `collections.Counter` - For counting occurrences of emojis. | ||
- `numpy` - For numerical operations. | ||
- `requests` - For downloading custom stopwords. | ||
- `re` - For regular expressions. | ||
- `zipfile`, `io` - For handling file uploads in `.zip` format. | ||
|
||
### 📊 **Exploratory Data Analysis Results** | ||
|
||
Below are some visualizations derived from the chat data: | ||
|
||
1. **Basic Information** | ||
![Sample Messages](images/sample_messages.png) | ||
|
||
2. **Emoji Distribution** | ||
![Emoji Distribution Pie Chart](images/emoji_distribution.png) | ||
|
||
3. **Emoji Usage by Author** | ||
![Emoji Usage Bar Chart](images/emoji_usage_author.png) | ||
|
||
4. **Top 10 Days With Most Messages** | ||
![Messages Bar Chart](images/top_days_messages.png) | ||
|
||
5. **Message Distribution by Day** | ||
![Messages Polar Chart](images/message_distribution.png) | ||
|
||
6. **Word Cloud** | ||
![Word Cloud](images/word_cloud.png) | ||
|
||
### 📈 **Performance of the Models based on the Accuracy Scores** | ||
|
||
No machine learning models were used in this project. It is purely focused on data analysis and visualization of WhatsApp chat data. | ||
|
||
### 📢 **Conclusion** | ||
|
||
This project successfully provides an interactive analysis of WhatsApp chat data, including statistics on user engagement, emoji usage, and response times. Users can explore individual and group message patterns, and the visualizations make it easy to understand user behavior. This type of analysis can be helpful for studying group dynamics or understanding communication patterns. | ||
|
||
### ✒️ **Your Signature** | ||
|
||
Created by Aritro Saha - | ||
[Website](https://aritro.tech/) | [GitHub](https://github.com/halcyon-past) | [LinkedIn](https://www.linkedin.com/in/aritro-saha/) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,273 @@ | ||
import streamlit as st | ||
import pandas as pd | ||
import plotly.express as px | ||
import matplotlib.pyplot as plt | ||
import seaborn as sns | ||
from wordcloud import WordCloud | ||
import nltk | ||
from nltk.corpus import stopwords | ||
import emojis | ||
from collections import Counter | ||
import plotly.graph_objs as go | ||
import numpy as np | ||
import requests | ||
import re | ||
import zipfile | ||
import io | ||
|
||
# Function to handle date conversion | ||
def handle_date(date_str): | ||
""" | ||
Converts a date string to datetime or timedelta. | ||
Args: | ||
date_str (str): Date string to convert | ||
Returns: | ||
datetime or timedelta: Converted date/time object | ||
""" | ||
try: | ||
return pd.to_datetime(date_str) | ||
except ValueError: | ||
# Handle incompatible format (e.g., "0 days 00:04:00") | ||
time_part = date_str.split()[2] | ||
return pd.to_timedelta(time_part) | ||
|
||
# Function to convert 24-hour time to 12-hour format | ||
def twelve_hr_convert(time_str): | ||
""" | ||
Converts 24-hour time format to 12-hour format. | ||
Args: | ||
time_str (str): Time string in 24-hour format | ||
Returns: | ||
tuple: (time_string, am/pm indicator) | ||
""" | ||
hour, minute = map(int, time_str.split(":")) | ||
am_pm = "am" if hour < 12 else "pm" | ||
hour = hour % 12 | ||
hour = 12 if hour == 0 else hour | ||
return f"{hour:02d}:{minute:02d}", am_pm | ||
|
||
# Function to process chat file | ||
@st.cache_data | ||
def process_chat_file(file_contents): | ||
""" | ||
Processes the chat file contents and extracts relevant information. | ||
Args: | ||
file_contents (str): Contents of the chat file | ||
Returns: | ||
tuple: (full_df, message_df, emoji_df, emoji_author_df) | ||
""" | ||
# Regular expressions for different date/time formats | ||
pattern1 = r"(\d+\/\d+\/\d+), (\d+:\d+\s(am|pm)) \- ([^\:]*):(.*)" | ||
pattern2 = r"(\d+\/\d+\/\d+), (\d+:\d+) \- ([^\:]*):(.*)" | ||
data = [] | ||
|
||
# Process each line of the file | ||
for line in file_contents.split("\n"): | ||
match = re.match(pattern1, line) or re.match(pattern2, line) | ||
if match: | ||
groups = match.groups() | ||
date, time = groups[0], groups[1] | ||
|
||
if len(groups) == 5: # Pattern 1 | ||
ampm, author, message = groups[2], groups[3], groups[4] | ||
time = time.replace("\u202fam", "").replace("\u202fpm", "") | ||
else: # Pattern 2 | ||
time, ampm = twelve_hr_convert(time) | ||
author, message = groups[2], groups[3] | ||
|
||
data.append({ | ||
"Date": date, | ||
"Time": time, | ||
"AM/PM": ampm, | ||
"Author": author.strip(), | ||
"Message": message.strip() | ||
}) | ||
|
||
# Create DataFrame | ||
df = pd.DataFrame(data) | ||
|
||
# Extract emojis from messages | ||
df["Emoji"] = df["Message"].apply(lambda text: emojis.get(str(text))) | ||
|
||
# Remove media messages and null messages | ||
message_df = df[~df["Message"].isin(['<Media omitted>', 'null'])] | ||
|
||
# Convert date and time columns | ||
message_df["Date"] = pd.to_datetime(message_df.Date) | ||
message_df["Time"] = pd.to_datetime(message_df.Time).dt.strftime('%H:%M') | ||
|
||
# Calculate letter and word counts | ||
message_df['Letter_Count'] = message_df['Message'].apply(lambda s: len(str(s))) | ||
message_df['Word_count'] = message_df['Message'].apply(lambda s: len(str(s).split(' '))) | ||
|
||
# Process emojis | ||
total_emojis_list = [emoji for emojis in message_df.Emoji for emoji in emojis] | ||
emoji_author_counts = {author: Counter() for author in message_df.Author.unique()} | ||
for emoji, author in zip(total_emojis_list, message_df.Author): | ||
emoji_author_counts[author][emoji] += 1 | ||
|
||
emoji_author_df = pd.DataFrame.from_dict(emoji_author_counts, orient='index').fillna(0) | ||
emoji_df = pd.DataFrame(sorted(Counter(total_emojis_list).items(), key=lambda x: x[1], reverse=True), | ||
columns=["emoji", "count"]) | ||
|
||
# Combine date and time into DateTime column | ||
message_df['DateTime'] = pd.to_datetime(message_df['Date'].dt.strftime('%Y-%m-%d') + ' ' + | ||
message_df['Time'] + ' ' + message_df['AM/PM'], format='mixed') | ||
message_df = message_df.sort_values(by='DateTime') | ||
|
||
# Calculate response times | ||
message_df['Response Time'] = pd.NaT | ||
last_message_time = {} | ||
for index, row in message_df.iterrows(): | ||
author, current_time = row['Author'], row['DateTime'] | ||
if author in last_message_time: | ||
last_time = last_message_time[author] | ||
if last_time is not pd.NaT: | ||
message_df.at[index, 'Response Time'] = (current_time - last_time).total_seconds() | ||
last_message_time[author] = current_time | ||
|
||
return df, message_df, emoji_df, emoji_author_df | ||
|
||
# Function to extract text file from zip archive | ||
def extract_text_file(uploaded_file): | ||
""" | ||
Extracts a text file from a zip archive. | ||
Args: | ||
uploaded_file (UploadedFile): The uploaded zip file | ||
Returns: | ||
str: Contents of the extracted text file or an error message | ||
""" | ||
try: | ||
zip_file = zipfile.ZipFile(io.BytesIO(uploaded_file.read())) | ||
text_file_name = next((name for name in zip_file.namelist() if name.endswith(".txt")), None) | ||
if text_file_name: | ||
return zip_file.read(text_file_name).decode("utf-8") | ||
else: | ||
return "No text file found in the zip archive." | ||
except Exception as e: | ||
return f"Error occurred: {str(e)}" | ||
|
||
# Main Streamlit app | ||
def main(): | ||
""" | ||
Main function to run the Streamlit app. | ||
""" | ||
st.set_page_config("WhatsApp Chat Analyzer", page_icon="📲", layout="centered") | ||
st.title("Chat Data Visualization") | ||
|
||
# File upload section | ||
uploaded_file = st.file_uploader("Upload a chat file") | ||
|
||
if uploaded_file is not None: | ||
file_extension = uploaded_file.name.split(".")[-1] | ||
if file_extension == "txt": | ||
file_contents = uploaded_file.read().decode("utf-8") | ||
elif file_extension == "zip": | ||
file_contents = extract_text_file(uploaded_file) | ||
else: | ||
st.error("Please upload a .txt or .zip file") | ||
return | ||
|
||
# Process chat data | ||
df, message_df, emoji_df, emoji_author_df = process_chat_file(file_contents) | ||
|
||
# Display basic information | ||
st.header("Basic Information (First 20 Conversations)") | ||
st.write(df.head(20)) | ||
|
||
# Display author stats | ||
st.header("Author Stats") | ||
for author in message_df['Author'].unique(): | ||
req_df = df[df["Author"] == author] | ||
st.subheader(f'Stats of {author}:') | ||
st.write(f'Messages sent: {req_df.shape[0]}') | ||
words_per_message = (np.sum(req_df['Word_count'])) / req_df.shape[0] | ||
st.write(f"Words per message: {words_per_message:.2f}") | ||
emoji_count = sum(req_df['Emoji'].str.len()) | ||
st.write(f'Emojis sent: {emoji_count}') | ||
avg_response_time = round(req_df['Response Time'].mean(), 2) | ||
st.write(f'Average Response Time: {avg_response_time} seconds') | ||
st.write('-----------------------------------') | ||
|
||
# Display emoji distribution | ||
st.header("Emoji Distribution") | ||
fig = px.pie(emoji_df, values='count', names='emoji', title='Emoji Distribution') | ||
fig.update_traces(textposition='inside', textinfo='percent+label') | ||
st.plotly_chart(fig) | ||
|
||
# Display emoji usage by author | ||
st.header("Emoji Usage by Author") | ||
fig = px.bar(emoji_author_df, x=emoji_author_df.index, y=emoji_author_df.columns, | ||
title="Emoji Usage by Author", barmode='stack') | ||
fig.update_layout(xaxis_title="Authors", yaxis_title="Count", legend_title="Emojis") | ||
st.plotly_chart(fig) | ||
|
||
# Display top 10 days with most messages | ||
st.header("Top 10 Days With Most Messages") | ||
messages_per_day = df.groupby(df['DateTime'].dt.date).size().reset_index(name='Messages') | ||
top_days = messages_per_day.sort_values(by='Messages', ascending=False).head(10) | ||
fig = go.Figure(data=[go.Bar( | ||
x=top_days['DateTime'], | ||
y=top_days['Messages'], | ||
marker=dict(color='rgba(58, 71, 80, 0.6)', line=dict(color='rgba(58, 71, 80, 1.0)', width=1.5)), | ||
text=top_days['Messages'] | ||
)]) | ||
fig.update_layout( | ||
title='Top 10 Days with Most Messages', | ||
xaxis=dict(title='Date', tickfont=dict(size=14, color='rgb(107, 107, 107)')), | ||
yaxis=dict(title='Number of Messages', titlefont=dict(size=16, color='rgb(107, 107, 107)')), | ||
bargap=0.1, | ||
bargroupgap=0.1, | ||
paper_bgcolor='rgb(233, 233, 233)', | ||
plot_bgcolor='rgb(233, 233, 233)', | ||
) | ||
st.plotly_chart(fig) | ||
|
||
# Display message distribution by day | ||
st.header("Message Distribution by Day") | ||
day_df = message_df[["Message", "DateTime"]].copy() | ||
day_df['day_of_week'] = day_df['DateTime'].dt.day_name() | ||
day_df["message_count"] = 1 | ||
day_counts = day_df.groupby("day_of_week").sum().reset_index() | ||
fig = px.line_polar(day_counts, r='message_count', theta='day_of_week', line_close=True) | ||
fig.update_traces(fill='toself') | ||
fig.update_layout(polar=dict(radialaxis=dict(visible=True)), showlegend=False) | ||
st.plotly_chart(fig) | ||
|
||
# Display word cloud | ||
st.header("Word Cloud") | ||
text = " ".join(str(review) for review in message_df.Message) | ||
nltk.download('stopwords', quiet=True) | ||
stopwords = set(nltk.corpus.stopwords.words('english')) | ||
# Add custom stopwords here | ||
stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content | ||
stopwords.update(list(set(stopwords_list.decode().splitlines()))) | ||
#Create the Word Cloud | ||
wordcloud = WordCloud(width=800, height=400, random_state=1, background_color='white', colormap='Set2', collocations=False, stopwords=stopwords).generate(text) | ||
fig, ax = plt.subplots(figsize=(10, 5)) | ||
ax.imshow(wordcloud, interpolation='bilinear') | ||
ax.axis("off") | ||
st.pyplot(fig) | ||
|
||
# Display Creator Details | ||
html_temp = """ | ||
<div style="text-align: center; font-size: 14px; padding: 5px;"> | ||
Created by Aritro Saha - | ||
<a href="https://aritro.tech/">Website</a>, | ||
<a href="https://github.com/halcyon-past">GitHub</a>, | ||
<a href="https://www.linkedin.com/in/aritro-saha/">LinkedIn</a> | ||
</div> | ||
""" | ||
st.markdown(html_temp, unsafe_allow_html=True) | ||
|
||
#driver code | ||
if __name__ == "__main__": | ||
main() |