Skip to content

Latest commit

 

History

History
82 lines (56 loc) · 2.26 KB

README.md

File metadata and controls

82 lines (56 loc) · 2.26 KB

Description

Twitter users often associate and socialize with other users based on similar interests. The Tweets of these users can be classified using a trained LDA model to automate the discovery of their similarities.

Prerequisites

Python 2.7 is recommended since the pattern library is currently incompatible with most Python 3 versions.

Python 3.6 can be used with the pattern library, though it may need to be built from source since most newer Linux distributions don't come with it pre-installed. The commands to build Python 3.6 from source are provided in the linux_setup_py3.6.sh script.

Installing

Linux

Download:

git clone https://github.com/kethort/twitter_LDA_topic_modeling.git

Run bash script:

./linux_setup_py3.6.sh

Python pip requirements included in these files:

# for Python 2.7
pip install -r requirements_py2.txt

# for Python 3
pip install -r requirements_py3.txt

Link to the simple-wikipedia dump:

https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

Mac osx

The installation is very similar to the linux installation:

extra install instructions in osx_setup_py3.6.info

pip install -r requirements_py3_OSX.txt

Process

  1. Get user and follower ids by location - twitter_user_grabber.py
  2. Download Tweets for each user - get_community_tweets.py
  3. Create an LDA model from a corpus of documents - create_LDA_model.py
  4. Generate topic probability distributions for Tweet documents - tweets_on_LDA.py
  5. Calculate distances between Tweet documents and graph them - plot_distances.py

Sample Visualizations

Built With

  • Gensim - Package for creating LDA model
  • pyLDAvis - Package for visualizing LDA model
  • Tweepy - Package for interacting with Twitter REST API
  • NLTK - Package for stopword management and tokenization