Skip to content

Latest commit

 

History

History
57 lines (44 loc) · 2.42 KB

README.md

File metadata and controls

57 lines (44 loc) · 2.42 KB

Text Mining Medium Posts

Author: Victor Pontes 2020-06-03

This repository is intended to record the procedures presented in the series of posts about Text Mining on my Medium Blog Post.

Published Articles:

Basic Usage

  • First of all, download the dataset News of Brazilian Newspaper and extract the articles.csv file into the 'data' directory of this repository.

  • Then it is necessary to configure the environment. I strongly recommend the use of virtual environments with conda. Above there is an example creating and activating a conda virtual environment.

	conda create -n text_mining python=3.8
	conda activate text_mining
  • Install the dependencies:
	make deps
  • After setting up the environment, you can run the get_base_files.py python script to generate the text mining models. In this step, text pre-processing, bag of words, topic modeling and clustering are performed.
	python get_base_files.py
  • From now on, if everything went well, the notebooks with the experiments can be run.

  • Directory Structure

├── LICENSE
├── Makefile           <- Makefile with command `make deps`
├── README.md          <- The top-level README for developers using this project.
├── data
├── models             <- Text models predictions.
├── notebooks          <- Jupyter notebooks.
├── reports            <- Generated analysis - CSV files
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment.
│
├── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    ├── builder.py     <- Functions for all steps of Text Mining 
    │
    ├── Text_Mining    <- Text Mining Library by Victor Pontes.
        │                 
        ├── preprocess.py     <- Module with specialized instructions for preprocess text data.
        ├── stopwords.py      <- List of stop words.
        ├── TopicClustering   <- Module with specialized instructions for matrix factorization, topic modeling and clustering activities.