This repository contains a CLI to retrieve speech data from Babel API.
- Python3
- Latest
pip
installed
sudo pip install pipenv # Install pipenv on your system
pipenv install # Install all requirements on a virtual environment
pipenv shell # Enter into the virtualenv created before
speeches.py [OPTIONS] INITIAL_DATE END_DATE
Options:
-s, --stage TEXT Initials from speech stage. For example, PE to 'Pequeno
Expediente'
--help Show this message and exit.
INITIAL_DATE
andEND_DATE
must be onyyyy-mm-dd
format.
After retrieve and process all speech data in the informed time, this scripts will create a csv called speeches.csv
.
After fetch the speeches that you need, you can perform a preprocessing, removing all numbers, accents, stopwords (also removing all the words that appears on more than 90% of documents and less than 1%) and stemming all tokens from the speeches. To do this follow the instructions:
./pre_process.py
This command will read speeches.csv
, generated by the previous script, and generate 4 csv files:
- stem.csv - list of all stems used (format: id,stem)
- stemmed-speeches.csv - list of all preprocessed speeches. There will be 2 rows by speech, the first one is the list of stem ids and the second is the frequency of that stem. Both rows are started by the speech ID
- metadatas.csv - list of all speeches metadatas (format: id,author_name,author_party,author_region,date,updated_at,stage)
- full-speeches.csv - list of all speeches without any processing (format: id,original)