Skip to content

115522/Tweeting

Repository files navigation

About The Project

Clustering tweets by utlizing cosine Distance metric and K-means clustering algorithm.

Introduction

Data redundancy is an important problem of Twitter. Twitter users are likely to generate similar tweets (e.g., using the Retweet function) about some popular topics/events. a result of a huge number of tweets which let tweetos not interested to loss time about reading for the same topic many tweets
So by clustering similar tweets together, we can generate a more concise and organized representation of the raw tweets, which will be very useful for busy Tweetos to read only one tweet per class

Project Objectives

  • Aim of this project is to cluster and label the text tweets.
    So when a new tweet is added to the corpus, it must be labeled easily without performing the full clustering again
  • KeyWords

    Text mining / clustering / NLP / tweepy / NLTK / twitter API

    Project scope

  • Data gathering (streaming tweets )
  • Data Processing and Wrangling (cleaning text tweets and apply NLP to text)
  • Vectorization (numerical data representation part)
  • cosine distance from nltk
  • apply k-means
  • label clusters
  • Installation

    1. Get a twitter API Key Try this link
      https://www.youtube.com/watch?v=vlvtqp44xoQ
    2. Install tweepy
      !pip install tweepy 
    1. Install NLTK
     !pip install nltk
    conda install -c anaconda nltk
    1. Install stopwords from nltk graphic ( download nltk )

    Evaluating Results

    K-Means algorithm has been executed by

    • data representation method :TF-IDF
    • Distance metrics : Cosine Similarity
    • k =6 values (2 to 6 clusters)

    TRY MY PROJECT CODE ON BINDER

    Binder

    About

    No description, website, or topics provided.

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published