Skip to content

A data analysis project about Turkish politicians based on the data from Turkish Twitter and a machine learning model that I wrote.

Notifications You must be signed in to change notification settings

C4MCI/Turkish-twitter-politician-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

In this project, I have written a machine learning model that determines whether the tweets about politicians are positive or negative. Then, I used the model to predict which politicians are liked more by the Turkish people. There are some interesting data, so I recommend you to at least check out the graphs. You can reach out to the dataset I have used here. I have filtered some and created a new dataset to use in this project.

Preprocessing Our Data

I will be using the Zemberek library to process our data. Zemberek is a natural language processing (NLP) tool for the Turkish language. The steps that I will be using are:

  • Sentence normalization
  • Removing punctuations, mentions, digits, and emojis
  • Tokenization
  • Removing stop words
  • Lemmatization
data["text"] = data["text"].apply(lambda x: str(normalizer.normalize(JString(x))))
data["text"] = data["text"].apply(lambda x: "".join([i for i in x if i not in string.punctuation]))
data["text"] = data["text"].apply(lambda x: "".join([i for i in x if not i.isdigit()]))
data["text"] = data["text"].apply(lambda x: tokenize(x))
data["text"] = data["text"].apply(lambda x: [i for i in x if i not in stop_words])
data["text"] = data["text"].apply(lambda x: [lemmatize(i) for i in x])
data["text"] = data["text"].apply(lambda x: " ".join([i for i in x]))

Visualizing the Data

I will be using the matplotlib library in python to visualize our data.

Negative - Positive Balance

alt text We can see from the graph that we have a nice balance in our data.

Most Used Words

alt text Seems pretty reasonable. Nothing interesting here.

Most Used Word Combinations (Bigram)

alt text Now that is interesting. The most used 2-word combinations are "orospu çocuk" and "am koy". Since we lemmatized the data, it shows the lemmatized version of very common bad words. What if we don't lemmatize?

alt text All right. That is pretty dark. I guess we can conclude that the Turkish Twitter community is pretty toxic.

Most Used Word Combinations (Trigram)

alt text Even though I can pretty much predict the results now, I wanted the see the trigram graph too. At this point, not really interesting.

Building Our Model

So we have an idea about the data we will use, thanks to the visualization. We can start building our machine learning model.

Vectorizing texts

Now that we processed our text, we need to somehow represent it with numbers to use for our machine learning model. I will be using TF - IDF Vectorizer from sklearn.

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
text_counts = tf.fit_transform(data["text"])

Choosing the best model

We are trying to build a classification model, so there are some algorithms that we can use. I have picked three algorithms that I believe will yield the best results. These are the Naive Bayes Classifier, Logistic Regression, and XGBoost Classifier. Let's try them and see which one is working best with our data.

clf = MultinomialNB()
clf.fit(X_train, y_train)

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)

xgb = XGBClassifier()
xgb.fit(X_train, y_train_xgb)

alt text

It seems like Logistic regression is working best with our data with a 92.5% accuracy score. I will admit that 92.5% accuracy is not great, but given our dataset is very limited, I will just call it acceptable. Since we will be using the Logistic Regression model in our analysis, let's save it so we don't have to process all this data again.

pickle.dump(lr, open(lr_model_filename, "wb"))

Analysis About Turkish Politicians

Now that we have a model that can accurately predict whether a tweet is negative or positive, we can fetch some tweets about politicians and use our model to find out public opinion about that politician. I will be comparing two Turkish politicians. Kemal Kılıçdaroğlu and Mansur Yavaş. For those who don't know who are these people, they are two of the most possible president candidates from the opposition party.

Kemal Kılıçdaroğlu

I have fetched some tweets about Kemal Kılıçdaroğlu. Let's throw them on our model and see what it looks like.

    predicted = model.predict(text_matrix)
    freq = nltk.FreqDist(predicted)
    df = pd.DataFrame(freq.items(), columns=["Label", "Frequency"])
    plt.style.use('ggplot')
    plt.pie(df["Frequency"], labels=df["Label"], autopct='%1.1f%%', shadow=True, colors=["lime", "thistle", "tomato"])
    plt.title("Public Opinion About Kemal Kılıçdaroğlu")
    plt.show()

alt text

Mansur Yavaş

I have just used the same code above.

alt text

Distributing Neutrals

In an election situation, you have to vote for one of them. That's why I wanted to see what would happen if there were no neutral opinions.

Kemal Kılıçdaroğlu

alt text

Mansur Yavaş

alt text

Last Words

Based on our machine learning model, it seems like Mansur Yavaş is more liked by the Turkish Twitter Community. But that does not mean that Kemal Kılıçdaroğlu is disliked. They both have mostly positive results as you can see in the graphs above. Hope you enjoyed reading this. There are some interesting and some expected results in this analysis. You can always try improving the model and contact me. I would be happy to talk with you.

About

A data analysis project about Turkish politicians based on the data from Turkish Twitter and a machine learning model that I wrote.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages