Skip to content
Rafael Epplée edited this page Oct 16, 2023 · 40 revisions

Intro

Here you can find a guide on how to contribute to the D-WISE Tool Suite (DWTS) as a developer.

The DWTS is developed as a client-server application and gets deployed via docker-compose. Since backend and frontend development is very different, this guide is split in two.

The setup described here uses docker compose to run auxiliary software like PostgreSQL, Redis etc., while running the backend and/or frontend outside of docker in a more traditional development environment.

Preparations

  1. Clone this repository git clone https://github.com/uhh-lt/dwts.git
  2. Install docker

Backend Guide

Intro and Structure

The backend is written in Python and is organized in directories each responsible for different tasks.

  • backend/src/api -> REST API endpoints consumed by the frontend (or other clients)
  • backend/src/app/core -> core backend logic such as the data model, internal service modules, search and analysis functionality
  • backend/src/app/docprepro -> document preprocessing logic
  • backend/src/test -> unit, integration, and e2e tests
  • backend/src/configs -> configuration files to customize the backend behavior (handle with care!)

Development Setup

Requirements

  • Linux machine, optionally with an Nvidia GPU
  • Docker
  • Conda

Step-by-step instructions

First, set up the services running inside docker containers:

  1. Configure the docker containers. In the docker directory, you will find a .env file to configure various settings:
    1. Change UID and GID to match user and group id of your current user. This will prevent permission problems with volumes mapping files into the container.
    2. Change COMPOSE_PROJECT_NAME to prevent collisions with other docker compose projects running on the same machine.
    3. Change all values ending in _EXPOSED to configure the various ports your services will be available on, taking care to prevent conflicts with other services running on your machine. We will configure the backend later to access these ports.
  2. In backend/src/app/preprocessing/ray_model_worker/config.yaml, you can configure the AI model worker, like specifying whether certain services should run on CPU (cpu) or GPU (cuda).
  3. Run the script docker/setup-folders.sh to automatically create directories for storing and caching ML models as well as uploaded documents.
  4. Copy docker-compose.yml to a docker-compose-dev.yml file and comment out the dwts-backend-api container. We won't need it because we will start the backend api ourselves, outside of docker. If you want to run the frontend outside of a container as well (covered later in this document), comment out the dwts-frontend container as well.
  5. Run docker compose -f docker-compose-dev.yml up -d to start all docker containers. Use docker compose -f docker-compose-dev.yml ps to check that all containers are running. On the first start, this will take quite a while, as the container will download some large AI models.

Then, set up the python backend API:

  1. Set up the mamba dependency resolver to speed up dependency installation:

     conda install -n base conda-libmamba-solver
     conda config --set solver libmamba
    
  2. Install dependencies with conda conda env create -f backend/environment.yml.

  3. Activate the conda environment you created in the previous step: conda activate <your_env_name>

  4. Tell the backend how to reach the various services running in docker. In backend/.env, set localhost for the _HOST settings of Redis, ElasticSearch etc. Set the _PORT settings to match the _EXPOSED ports from before in docker/.env.

  5. Load the backend configuration into your shell using set -o allexport; source backend/.env; set +o allexport.

  6. The backend configuration is located in backend/src/configs/default_localhost_dev.yaml by default. Change the repo.root_directory string to point to the docker/backend_repo folder inside your repository.

  7. In the backend/src folder, run python main.py to start the FastAPI backend. The following config (launch.json) is recommended to use with VSCODE:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python: main.py",
      "type": "python",
      "request": "launch",
      "program": "${workspaceFolder}/backend/src/main.py",
      "console": "integratedTerminal",
      "justMyCode": true,
      "env": {
        "PYTHONPATH": "${workspaceFolder}/backend/src",
        "HUGGINGFACE_HUB_CACHE": "${workspaceFolder}/docker/models_cache",
        "TRANSFORMERS_CACHE": "${workspaceFolder}/docker/models_cache",
        "TORCH_HOME": "${workspaceFolder}/docker/models_cache"
      }
    },
  ]
}

Building for Production With Docker

This section describes deploying the backend API to production.

One container is built for the backend API and all celery workers as they share the same dependencies. The startup command (entrypoint) decides if the container is serving the API or running as a worker. The different entrypoints can be found at backend/src/*_entrypoint.sh, where every script corresponds to exactly one worker type. The backend source code is copied into the container, so are the config files in backend/src/configs, which can be altered to configure various settings.

  1. Build docker container: docker build -f Dockerfile -t uhhlt/dwts_backend:debian_dev_latest .
  2. (optional) Push docker container: docker push uhhlt/dwts_backend:debian_dev_latest

Data Model

The Data Model (DM) (located at backend/src/app/core/data/) is the core of the DWTS and represents all it's entities. The DM is based on sqlalchemy and pydantic. Although sqlalchemy is (mostly) database agnostic, the intended database for the DWTS DM is PostgreSQL.

The entities (i.e. database tables) are based on sqlalchemy and are located in backend/src/app/core/data/orm -- ORM stands for Object-Relational Mapping and defines the bridge between python classes and objects and the database. To perform Create-Read-Update-Delete (CRUD) operations we use the interfaces defined in backend/src/app/core/data/crud. To transfer entities between the frontend and the backend (and also often internally in the backend) we make use the DTOs (Data Transfer Objects) based on pydantic, which are located in backend/src/app/core/data/dto.

How to extend the Data Model

In this guide, you will learn all steps necessary to extend the DWTS Data Model.

Please always have a look at implementations of other existing entities for examples and coding/naming conventions!

Step 1: Define the ORM

To define the ORM, first create a new file in backend/src/app/core/data/orm. For examples and further help, have a look at other files defined in this directory and the sqlalchemy documentation.

If you need an ObjectHandle (basically a pointer) for that entity (e.g. to enable Memos for that entity), you also have to extend the ObjectHandleORM accordingly.

Step 2: Register the ORM

To register the ORM import the class in the SQLService (backend/src/app/core/data/db/sql_service.py). After this step, when you restart the application, sqlalchemy automatically creates a table as defined by the ORM.

⚠️ Note that we currently do not support DB migrations, which can lead to problems when extending the DM with an old, i.e., not fresh, database. In a future version, when alembic DB migrations are supported, we will update this guide!

Step 3: Define the DTOs

To define the DTOs, first, create a new file in backend/src/app/core/data/dto with the same filename as the ORM file. In that file, add the DTOs as separate classes that inherit from each other (if possible).

Step 4: Implement CRUD

To implement the CRUD object, first, create a new file in backend/src/app/core/data/crud with the same filename as the ORM and DTO file. In that file, create a CRUD class that inherits from CRUDBase with the DTOs from the previous step. This already provides basic CRUD operations. If you need to customize or add operations, implement the respective methods in this class.

Frontend Guide

Intro and Structure

The frontend is written in TypeScript using React, bootstrapped with Create React App. It is organized in directories each responsible for different tasks.

  • frontend/src/api -> custom TanStack Query hooks to communicate with the backend via the automatically generated API client in openapi.
  • frontend/src/components -> reusable UI components
  • frontend/src/features -> components that implement logic and access the API to build a feature. These features are used across the app and are not specific to a certain route.
  • frontend/src/layouts -> different page layouts used by the routing library.
  • frontend/src/plugins -> configuration files to customize the behavior of various plugins.
  • frontend/src/router -> configuration of the React Router.
  • frontend/src/store -> configuration of the global store - Redux Toolkit.
  • frontend/src/view -> the main directory of the app. Every subfolder corresponds to a route. All features specific only to the corresponding route are implemented here.

Development Setup

Requirements

  • Linux machine with Nvidia GPU
  • Docker
  • Node v18

Step-by-step instructions

  1. Refer to the backend section to set up services running inside docker.
  2. Install dependencies with npm cd frontend && npm install -f.
  3. Configure the frontend. In frontend/.env.development, configure REACT_APP_SERVER and REACT_APP_CONTENT with the values from the previous step (API_EXPOSED and CONTENT_SERVER_EXPOSED).
  4. Load the frontend configuration into your current shell using set -o allexport; source .env.development; set +o allexport.
  5. Run npm start to start the frontend. Check http://localhost:3030 to see if everything is working.

Build for Production With Docker

When deploying the frontend, the React code is first bundled and then served via an NGINX web server. In production mode, the NGINX web server also acts as a reverse proxy for communication with the backend. frontend/.env.production/ is configured to make the backend api available at /api and the content server available at /content. The matching configuration of the NGINX web server is located at /docker/nginx.conf.

  1. Build docker container: docker build -f Dockerfile -t uhhlt/dwts_frontend:latest .
  2. (optional) Push docker container: docker push uhhlt/dwts_frontend:latest

Communication with the backend

The backend uses FastAPI to serve an accessible API that follows the OpenAPI standard. We consume this API by automatically generating an OpenAPI client with OpenAPI Typescript Codegen and build hooks around that with TanStack Query, that can be conveniently used in all components.

In case the backend is updated and offers new API endpoints, the following steps must be performed to make them available in the frontend:

  1. Download the new OpenAPI specification of the backend. Run npm run update-api. You probably have to set the correct backend API_PORT in frontend/package.json scripts > update-api for this to work.
  2. Generate the new client. Run npm run generate-dev. This command deletes everything in frontend/src/api/openapi, generates new code, and formats it with prettier.
  3. Implement a new Hook in frontend/src/api.

Deployment

Docker-compose is used to orchestrate the frontend, API, celery workers databases and other services that are used in the D-WISE Tool Suite.

  1. Configure ports and credentials in /docker/.env
  2. (optional) Note the environment variable DWISE_BACKEND_CONFIG, which links to backend/src/configs/default_localhost_dev.yaml per default, additional adjustments can be made.
  3. Set-up the folder structure with sh /docker/setup-folders.sh. This script creates models_cache for storing various ML models, spacy_models for storing various spaCy models for different languages, backend_repo for storing uploaded documents and tika that stores Java executables for Apache TIKA.
  4. Deploy the application with docker compose up -d. Check with docker compose ps that all containers are running.

Versioning

We use Semantic Versioning as explained below:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes
  2. MINOR version when you add functionality in a backward compatible manner
  3. PATCH version when you make backward compatible bug fixes

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

For reference, see https://semver.org/.

Clone this wiki locally