Skip to content

Latest commit

 

History

History
144 lines (111 loc) · 13.4 KB

README.md

File metadata and controls

144 lines (111 loc) · 13.4 KB

German Text Embedding Clustering Benchmark

Shortcut: Datasets - Results - Installation - Usage - Citation

Remarks

This repository contains code to evaluate language models for clustering word embeddings as used in neural topic modelling (see for example BERTopic) specifically for German. This work builds on Massive Text Embedding Benchmark (MTEB), which provides benchmark datasets and results for a wide range of tasks.

More specifically, this work contributes to mteb in the following ways:

  • clustering datasets in German (MTEB only consider English datasets)
  • the evaluation of more clustering algorithms

🏆 Note that you can contribute results to the MTEB leaderboard as our datasets are officially part of MTEB (apart from the Reddit datasets, see below)! You can either use this library or MTEB directly to produce results. If you run into any problems, please raise an issue. 🏆

Datasets

Currently, we provide 4 datasets. The datasets are built similarly to the English clustering datasets in MTEB. Unfortunately, there are fewer datasets available for German and, therefore, we were not able to build as many datasets (e.g. Arxiv only contains very few German papers). However, we plan to add more datasets in the future.

Name Hub URL Description
BlurbsClusteringS2S
(data ref.)
slvnwhrl/blurbs-clustering-s2s Clustering of book titles: 17'726 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes (as represented by genres, e.g. fantasy). On average, a sample is 23.17 chars long. Splits are built similarly to MTEB's ArxivClusteringS2S (Paper).
BlurbsClusteringP2P
(data ref.)
slvnwhrl/blurbs-clustering-p2p Clustering of book blurbs (title + blurb): Clustering of book titles: 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes as represented by genres, e.g. fantasy. On average, a sample is 663.91 chars long. Splits are built similarly to MTEB's ArxivClusteringP2P (paper).
TenKGNADClusteringS2S
(data ref.)
slvnwhrl/tenkgnad-clustering-s2s Clustering of news article titles: 10'267 unique samples, 10 splits with 1'436 to 9'962 samples and 9 unique classes (as represented by, e.g. politics). On average, a sample is 50.97 chars long. Splits are built similarly to MTEB's TwentyNewsgroupsClustering (paper).
TenKGNADClusteringP2P
(data ref.)
slvnwhrl-tenkgnad-clustering-p2p Clustering of news articles (title + article body): 10'275 unique samples, 10 splits with 1'436 to 9'962 samples and 9 unique classes (as represented by, e.g. politics). On average, a sample is 2648.46 chars long. Splits are built similarly to MTEB's TwentyNewsgroupsClustering (paper).

Reddit datasets

We also include two Reddit datasets in the benchmark (similar to MTEB's RedditClustering and RedditClusteringP2P datasets). However, we only provide ids, and if you want to use these datasets, you need to download the data yourself (see Including the Reddit dataset for instructions). The datasets contain "hot" and "top" submissions to 80 popular German subreddits and were extracted using PRAW.

Name Description
RedditClusteringS2S Clustering of reddit submission titles: 40'181 unique samples, 10 splits with 9'288 to 26'221 samples and 10 to 50 unique classes (as represented by subbredits, e.g. r/Finanzen). On average, a sample is 52.16 chars long. Splits are built similarly to MTEB's RedditClustering (paper).
RedditClusteringP2P Clustering of reddit submissions (title + body): 40'305 unique samples, 10 splits with 9'288 to 26'221 samples and 10 to 50 unique classes (as represented by subbredits, e.g. r/Finanzen). On average, a sample is 901.78 chars long. Splits are built similarly to MTEB's RedditClusteringP2P (paper).

Important: As of June 19, 2023 new Data API Terms become effective for Reddit. Most likely, it will not be allowed anymore to use Reddit data for such purposes (see especially "2.4 User Content" in the terms). Make sure you understand these terms and use Reddit data accordingly.

Results

All results show the V-measure (multiplied by 100 and rounded to two decimal points).

k-means (same as MTEB)

Model BlurbsClusteringS2S BlurbsClusteringP2P TenKGNADClusteringS2S TenKGNADClusteringP2P RedditClusteringS2S RedditClusteringP2P AVG
deepset/gbert-base 11.27 35.36 24.23 37.16 28.57 35.30 28.66
deepset/gbert-large 13.34 39.30 34.97 41.69 34.35 44.61 34.71
deepset/gelectra-base 7.74 10.06 4.11 9.02 6.59 7.73 7.54
deepset/gelectra-large 7.57 13.96 3.91 11.49 7.59 10.54 9.18
uklfr/gottbert-base 8.37 34.49 9.34 33.66 16.07 19.46 20.23
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 14.33 32.46 22.26 36.13 33.33 44.59 30.52
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 15.81 34.38 22.00 35.96 36.39 48.43 32.16
T-Systems-onsite/cross-en-de-roberta-sentence-transformer 12.69 30.81 10.94 23.50 27.98 33.01 23.16
sentence-transformers/use-cmlm-multilingual 15.24 29.63 25.64 37.10 33.62 49.70 31.82
sentence-transformers/sentence-t5-base 11.57 30.59 18.11 44.88 31.99 45.80 30.49
sentence-transformers/sentence-t5-xxl 15.94 39.91 19.69 43.43 38.54 55.90 35.57
xlm-roberta-large 7.29 29.84 6.16 32.46 10.19 23.50 18.24

additional clustering algorithms

In addition to k-means, we evaluate the following different clustering algorithms:

Inspired by BERTopic, we also evaluate PCA and UMAP as a "preprocessing" step.

If you want add/evaluate more algorithms, please have a look at FlexibleClusteringEvaluator.py on how to achieve that.

UMAP + {k-means, HDBSCAN, DBSTREAM}

For all results have a look at results/tecb-de-full-results.csv.

Model Algorithm BlurbsClusteringS2S BlurbsClusteringP2P 10KGNADClusteringS2S 10KGNADClusteringP2P RedditClusteringS2S RedditClusteringP2P AVG
deepset/gbert-base k-means
HDBSCAN
DBSTREAM
12.81
14.31
12.70
38.81
22.83
37.06
29.31
05.44
28.92
43.61
32.45
42.74
31.77
17.21
31.70
46.06
31.99
44.84
33.73
20.71
32.99
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 k-means
HDBSCAN
DBSTREAM
13.80
20.00
15.77
34.16
30.67
32.47
25.22
24.90
26.44
43.75
35.53
41.31
32.64
32.23
33.14
47.46
39.70
46.47
32.84
30.51
32.60

Installation

If you want to run the code from this repository (for creating the Reddit dataset or model evaluation), clone this repository and move to the downloaded folder

git clone https://github.com/ClimSocAna/tecb-de.git 

and create a new environment with the necessary packages

python -m venv tecb-de

source tecb-de/bin/activate # Linux, MacOS
venv\Scripts\activate.bat # Windows

pip install -r requirements.txt

Usage

Running the evaluation

Simply run python scripts/run_cteb_de.py. This will produce an results folder. You can modify the script to run the evaluation for models and clustering algorithms (and configuration) of your choosing.

Including the Reddit dataset

If you want to use the reddit dataset, you first have to download the data

# move to the reddit_data folder in tecb-de
# make sure you have PRAW and tdqm installed: pip install praw, pip install tqdm

# downloads the data and saves it to submissions.tsv
python download.py

Note that for this to work, you have to edit the reddit_data/praw.ini with your client data. You can find instructions here.

Then you can create the datasets

# creates the splits for both tasks (RedditClusteringS2S and ReddictClusteringP2P)
# and saves them in the reddit_data folder
python create_splits.py

Finally, you can can run the evaluation using the --include-reddit flag

# assuming your position is in the top-level folder
python scripts/run_cteb_de.py --reddit-flag

Adaptive pre-training

If you want to experiment with adaptive pre-training, you can have a look at scripts/run_fine_tuned_cteb_de.py. Basically, it allows you to train models using whole word masking (WWM) and TSDAE and to evaluate on a clustering algorithm during training.

Citation

If you make use of this work, please cite:

@inproceedings{wehrli-etal-2023-german,
    title = "{G}erman Text Embedding Clustering Benchmark",
    author = "Wehrli, Silvan  and
      Arnrich, Bert  and
      Irrgang, Christopher",
    editor = "Georges, Munir  and
      Herygers, Aaricia  and
      Friedrich, Annemarie  and
      Roth, Benjamin",
    booktitle = "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)",
    month = sep,
    year = "2023",
    address = "Ingolstadt, Germany",
    publisher = "Association for Computational Lingustics",
    url = "https://aclanthology.org/2023.konvens-main.20",
    pages = "187--201",
}