Real-time Amazon bot detection with BERT and Deephaven

This example uses Deephaven to perform real-time predictions of whether or not an Amazon review was generated by ChatGPT. The data comes from the Amazon Reviews Dataset, collected by Julian McAuley's lab and hosted on Huggingface.

The model used for bot prediction comes from Vidhi Kishor Waghela's entry in a ChatGPT-generated text detection Kaggle competition. The detector training data, script, and resulting PyTorch model are stored in the detector directory.

This Deephaven example can be run in Jupyter using Deephaven's Python package, or inside of a Docker container. We've provided scripts, notebooks, and instructions for each of Jupyter and Docker, so pick the path that feels most comfortable to you.

Git LFS

The trained PyTorch model used in this project is stored with Git LFS. To access this model, you need to install LFS and use it for this repository.

Install Git LFS by following the instructions here.
Configure Git LFS for this repo and use it to pull the PyTorch model:
```
git lfs install
git lfs fetch
git lfs pull
```

Now that the PyTorch model is available, continue to the Jupyter or Docker section to start working with this example.

Jupyter

Deephaven's Python package requires Java 17 or higher to be installed on your machine. See this page for OS-specific instructions on installing Java.

Set up the environment

Navigate to the jupyter subdirectory:
```
cd jupyter
```
Then, execute a script to set up the environment:
```
chmod +x create-venv.sh
./create-venv.sh
```
This creates a Python virtual environment called dh-amazon-venv and installs all of the required Python packages into that environment.

Next, activate the environment and start Jupyter:

source dh-amazon-venv/bin/activate
jupyter notebook

Once you've started Jupyter, you're ready to go!

Download the data

This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores.

Open the download_data.ipynb notebook and select the dh-amazon-venv kernel.
Set the NUM_PROC variable at the top of the second cell equal to the number of processors available to you. This has a significant impact on the download speed.
Run the whole notebook. This will download the Amazon data, filter it for 2023, and write it to the amazon-data directory in Parquet format.

Run the example

Finally, navigate to the detect_bots.ipynb notebook and select the dh-amazon-venv kernel. This notebook walks you through the whole example, and gives you the opportunity to play with Deephaven. We hope you learn something new!

Docker

To run this example with Docker, you must have Docker installed on your machine. See this guide for OS-specific instructions.

Start the Deephaven server

Navigate to the docker subdirectory:
```
cd docker
```
Build and run the Docker image using docker-compose:
```
docker compose up
```
Once the image is built, navigate to the Deephaven IDE at http://localhost:10000/ide/.

The Deephaven IDE contains all of the scripts associated to this example. Let's get started!

Download the data

This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores. You may need to allocate more resources to the Docker engine to access the full capabilities of your machine. This can be done using Docker Desktop. See this guide for more details.

In the right-hand sidebar, open the download_data.py script.
Set the NUM_PROC variable in line 8 equal to the number of processors available to you. This has a significant impact on the download speed.
Run the script using the "play" button at the top of the screen. This will download the Amazon data, filter it for 2023, and write it to the amazon-data directory in Parquet format.

Run the example

Once you've downloaded the data, you're ready to start working with the example. The code is divided between two scripts, stream_data.py and detect_bots.py. Running detect_bots.py will also execute stream_data.py, so you can start there if you'd like. We hope you enjoy this example!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
amazon-data		amazon-data
detector		detector
docker		docker
jupyter		jupyter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Amazon bot detection with BERT and Deephaven

Git LFS

Jupyter

Set up the environment

Download the data

Run the example

Docker

Start the Deephaven server

Download the data

Run the example

About

Releases

Packages

Languages

License

deephaven-examples/amazon-bot-detection

Folders and files

Latest commit

History

Repository files navigation

Real-time Amazon bot detection with BERT and Deephaven

Git LFS

Jupyter

Set up the environment

Download the data

Run the example

Docker

Start the Deephaven server

Download the data

Run the example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages