Skip to content

Website to visualise a NLP project on text generation. A GPT-2 model re-trained to mimic as closely as possible the writing style of Karl Marx comments daily on the latest news from The Guardian!

License

Notifications You must be signed in to change notification settings

fra-mari/The-Karl-Marx-Press-Review

Repository files navigation

The Karl Marx’s Press Review

A website to visualise a NLP project on text generation with GPT-2

made-with-python Maintenance Website cv.lbesson.qc.to MIT license

Concept

This website has been developed with Python and Flask to provide a visualisation of a project on text generation with GPT-2. Trained on 8 million webpages, GPT-2 was released in 2019 for generating high-quality short texts based on a few words provided as an initial input.

For this project, I decided to fine-tune the GPT-2, that is to make it especially sensitive to a certain vocabulary or style, so as to reproduce those features while generating text. To achieve this goal and to better be able to assess its results, I first looked for a corpus of texts featuring a very colourful, if homogenous, rhetoric to retrain the GPT-2 on. My choice eventually fell on the works of Karl Marx and Friedrich Engels, which I scraped from the website of the Marx Engels Archive. Moreover, to observe the outcome of “Marxist” textual generation on a variety of topics, I thought it could be interesting to provide inputs for the GPT-2 from a newspaper. Hence the idea of this peculiar press review, which is updated with the latest news of The Guardian.

gif

Architecture

The model is fine-tuned by means of aitextgen, a Python library developed by Max Woolf. Aitextgen leverages PyTorch to retrain the 124 M version of GPT-2 using the dataset provided by the user.

The webapp cyclically collects articles from The Guardian’s API and uses the language model to generate “Marxist comments” based on them. I also implemented some basic sentiment analysis on the generated comments using VADER (Valence Aware Dictionary and sEntiment Reasoner). All these data are eventually stored in the SQL database.

The website also features a function for directly interacting with the model. The Marxist GPT-2 is not always very intelligent, however it is pretty opinionated one, and it is always fun to talk to it! 😉

gif
It is a bit long but...the wait is worth the pain!

To Use This Code Locally

STEP 1: Generate the dataset and the model:

  • Clone this repository
  • Go to the folder data_and_model:
    • install the requirements with pip install requirements.txt;
    • run: python scraper_preprocesser.py to download the dataset on which to fine-tune the GPT-2 model (marx.txt). After running the process, you should see it in a new subfolder called training_dataset/preprocessed;
    • To download and fine-tune the GPT-2 model, load the Notebook Text-Generating_GPT-2_Finetuner_on_Colab_GPU.ipynb into your Google Drive, open it with Google Colaboratory and follow the instructions to create the two files pytorch_model.bin and config.json;
    • Paste these files into the two subfolders trained_model to be found in: marxist_press_review/article_collector/ and marxist_press_review/press_review_app/.

STEP 2: Setting the required environment variables

In order for this webapp to work, you will need to set two environment variables:

  1. The password for the PostgreSQL database which will be created. Unless you do not want to change its name in the docker-compose.yml file, this variable must be called POSTGRES_PASSWORD;
  2. The API key for The Guardian Open Platform, which you can generate upon free registration via this link. Unless you do not want to change its name in the docker-compose.yml file, this variable must be called GUARDIAN_API_KEY.

STEP 3: Running the webapp with Docker

  • Install Docker, then go into the folder marxist_press_review:
    • run docker-compose build and wait for Docker to set up everything for you;
    • run docker-compose up and wait for the log to confirm that the webapp has correctly started (something like press_review_app_1 | 2021-08-16 15:38:21,336: INFO: * Running on http://<ANY-ADDRESS-ENDING-BY-:5000/>);
    • wait for another while, as the software downloads the most recent articles from The Guardian's API to PostgreSQL (the log will confirm that your API key works correctly by printing lines such as: 2021-08-16 15:40:12,819: INFO: Successfully connected to https://content.guardianapis.com/search?section=world: scraping...);
  • open the address http://localhost:5000 in your browser and the website should appear.

Have fun talking with Karl Marx!


Further Reading


To Do

  • Increase the size of the training dataset
  • Add more documentation
  • Host on GCP
  • Tests

About

Website to visualise a NLP project on text generation. A GPT-2 model re-trained to mimic as closely as possible the writing style of Karl Marx comments daily on the latest news from The Guardian!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published