The Karl Marx’s Press Review

A website to visualise a NLP project on text generation with GPT-2

Concept

This website has been developed with Python and Flask to provide a visualisation of a project on text generation with GPT-2. Trained on 8 million webpages, GPT-2 was released in 2019 for generating high-quality short texts based on a few words provided as an initial input.

For this project, I decided to fine-tune the GPT-2, that is to make it especially sensitive to a certain vocabulary or style, so as to reproduce those features while generating text. To achieve this goal and to better be able to assess its results, I first looked for a corpus of texts featuring a very colourful, if homogenous, rhetoric to retrain the GPT-2 on. My choice eventually fell on the works of Karl Marx and Friedrich Engels, which I scraped from the website of the Marx Engels Archive. Moreover, to observe the outcome of “Marxist” textual generation on a variety of topics, I thought it could be interesting to provide inputs for the GPT-2 from a newspaper. Hence the idea of this peculiar press review, which is updated with the latest news of The Guardian.

Architecture

The model is fine-tuned by means of aitextgen, a Python library developed by Max Woolf. Aitextgen leverages PyTorch to retrain the 124 M version of GPT-2 using the dataset provided by the user.

The webapp cyclically collects articles from The Guardian’s API and uses the language model to generate “Marxist comments” based on them. I also implemented some basic sentiment analysis on the generated comments using VADER (Valence Aware Dictionary and sEntiment Reasoner). All these data are eventually stored in the SQL database.

The website also features a function for directly interacting with the model. The Marxist GPT-2 is not always very intelligent, however it is pretty opinionated one, and it is always fun to talk to it! 😉


It is a bit long but...the wait is worth the pain!

To Use This Code Locally

STEP 1: Generate the dataset and the model:

Clone this repository
Go to the folder data_and_model:
- install the requirements with pip install requirements.txt;
- run: python scraper_preprocesser.py to download the dataset on which to fine-tune the GPT-2 model (marx.txt). After running the process, you should see it in a new subfolder called training_dataset/preprocessed;
- To download and fine-tune the GPT-2 model, load the Notebook Text-Generating_GPT-2_Finetuner_on_Colab_GPU.ipynb into your Google Drive, open it with Google Colaboratory and follow the instructions to create the two files pytorch_model.bin and config.json;
- Paste these files into the two subfolders trained_model to be found in: marxist_press_review/article_collector/ and marxist_press_review/press_review_app/.

STEP 2: Setting the required environment variables

In order for this webapp to work, you will need to set two environment variables:

The password for the PostgreSQL database which will be created. Unless you do not want to change its name in the docker-compose.yml file, this variable must be called POSTGRES_PASSWORD;
The API key for The Guardian Open Platform, which you can generate upon free registration via this link. Unless you do not want to change its name in the docker-compose.yml file, this variable must be called GUARDIAN_API_KEY.

STEP 3: Running the webapp with Docker

Install Docker, then go into the folder marxist_press_review:
- run docker-compose build and wait for Docker to set up everything for you;
- run docker-compose up and wait for the log to confirm that the webapp has correctly started (something like press_review_app_1 | 2021-08-16 15:38:21,336: INFO: * Running on http://<ANY-ADDRESS-ENDING-BY-:5000/>);
- wait for another while, as the software downloads the most recent articles from The Guardian's API to PostgreSQL (the log will confirm that your API key works correctly by printing lines such as: 2021-08-16 15:40:12,819: INFO: Successfully connected to https://content.guardianapis.com/search?section=world: scraping...);
open the address http://localhost:5000 in your browser and the website should appear.

Have fun talking with Karl Marx!

To Do

Increase the size of the training dataset
Add more documentation
Host on GCP
Tests

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data_and_model		data_and_model
marxist_press_review		marxist_press_review
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generator.gif		generator.gif
press-review.gif		press-review.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Karl Marx’s Press Review

Concept

Architecture

To Use This Code Locally

STEP 1: Generate the dataset and the model:

STEP 2: Setting the required environment variables

STEP 3: Running the webapp with Docker

Further Reading

To Do

About

Releases

Packages

Languages

License

fra-mari/The-Karl-Marx-Press-Review

Folders and files

Latest commit

History

Repository files navigation

The Karl Marx’s Press Review

Concept

Architecture

To Use This Code Locally

STEP 1: Generate the dataset and the model:

STEP 2: Setting the required environment variables

STEP 3: Running the webapp with Docker

Further Reading

To Do

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages