Report: link
This repository contains the code for the Coliee24 challenge. We focused on Task 1. The task is to predict the citations of a given case law. In other words, given a query and an evidence, the task is to predict whether the evidence is relevant to the query or not.
The dataset is provided by the organizers of the challenge. It contains the training and test corpus. The dataset is not included in this repository, but can be requested to the organizers. For more information on the challenge, visit the official website.
To process the corpus, we first apply regular text preprocessing techniques. We remove special characters, recurrent tags, multiple spaces and sentences that bias the prediction.
Then, since the texts are fragmented, we perform sentence segmentation with spaCy. Every part of the document is originally separated by a tag (a number in square brackets) to indicate the beginning of a new paragraph. These tags are maintained in the sentence segmentation process.
Finally, on each sentence, French language detection and translation to English are performed with lingua-language-detector and argostranslate packages respectively.
Obviously, we concatenate the sentences to form the texts again. Now every document has the following structure:
[1]
... text in English ...
[2]
... text in English ...
[...]
[N]
... text in English ...
N
is the number of paragraphs in the document.
- Random: Randomly predicts whether the evidence is relevant or not.
- All-ones: Predicts that all the evidences are relevant.
- TF-IDF: Every document is represented by a TF-IDF vector. The cosine similarity between the query and the evidence is calculated. The top n evidences with the highest similarity are selected.
- Okapi BM25: The ranking function is used to score the relevance of the evidence to the query. The top n evidences with the highest score are selected.
- GPT text-embedding-3-small: The GPT model is used to generate embeddings for the query and the evidence. The cosine similarity between the embeddings is calculated. The top n evidences with the highest similarity are selected.
- Embedding Head: The GPT model is used to generate embeddings for the query and the evidence. The embeddings are passed through a feed-forward neural network that reduces the dimensionality of the embeddings and is fine-tuned on the training data with a contrastive loss function. The cosine similarity between the embeddings is calculated. The top n evidences with the highest similarity are selected.
Notation:
- e: embedding of a document;
- ē: mean of the embeddings of a document;
- e*: projection of ē to the latent space of the feed-forward neural network;
- D: set of all documents in the corpus (queries and evidences);
- qi: query i that belongs to D;
- Di: D without qi;
- dij: evidence j that belongs to Di;
- sij: similarity score between qi and dij;
- Dki: top k evidences that have the highest similarity with qi;
- dkij: evidence j that belongs to Dki;
- Ei: selected evidences for query i that belong to Dki.
┌──────────┐ │ │
│ Document │ │ ┌──────┐ │
└────┬─────┘ │ ┌┴─────┐│ │
│ │ ┌─────┐ ┌┴─────┐├┘ │
┌─────────┼─────────┐ │ │ q_i │ │ d_ij ├┘ │
│ │ │ │ └──┬──┘ └──┬───┘ │
│ ▼ │ │ │ │ │
│ ┌─────┐ │ │ ▼ ▼ │
│ │ GPT │ │ │ Pre-processing Pre-processing │
│ └──┬──┘ │ │ │ │ │
│ │ │ │ │ │ │
│ ▼ │ │ ▼ ▼ │
│ ┌───┐ │ │ ┌───────────┐ ┌───────────┐ │
│ ┌┴──┐│ │ ┌───────────┐ │ │ Embedding │ │ Embedding │ │
│ ┌┴──┐├┘ │ │ Embedding │ │ │ Head │ │ Head │ │
│ │ e ├┘ ├──┤ Head │ │ │ (recall) │ │ (recall) │ │
│ └─┬─┘ │ │ (m) │ │ └─────┬─────┘ └─────┬─────┘ │
│ │ │ └───────────┘ │ │ │ │
│ │mean │ │ └──────┐ ┌─────┘ │
│ ▼ │ │ ▼ ▼ │
│ ┌───┐ │ │ ┌───────────────────┐ │
│ │ ē │ │ │ │ Cosine similarity │ │
│ └─┬─┘ │ │ └─────────┬─────────┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌───────────────┐ │ │ s_ij │
│ │ Feed-forward │ │ │ │ │
│ │ NN trained on │ │ │ │ │
│ │ metric `m` │ │ │ ▼ │
│ └───────┬───────┘ │ │ ┌───────┐ │
│ │ │ │ │ Top k │ │
└─────────┼─────────┘ │ └───┬───┘ │
│ │ │ │
▼ │ ▼ │
┌───┐ │ ┌───────┐ │
│ e*│ │ │ D^k_i │ │
└───┘ │ └───────┘ │
┌────────┐
┌┴───────┐│
┌─────┐ ┌┴───────┐├┘
│ q_i │ │ d^k_ij ├┘
└──┬──┘ └───┬────┘
│ │
▼ ▼
Pre-processing Pre-processing
│ │
│ │
├───────────────────────────┇──────────────────┬───────────┐
│ │ │ │
┌─────┴───────┐ ┌─────┴───────┬──────────┇────┬──────┇───┐
│ │ │ │ │ │ │ │
▼ ▼ │ │ │ │ │ │
┌───────────┐ ┌──────┐ ┌─────┴─────┐ ┌──┴───┐ │ │ │ │
│ Embedding │ │ GPT │ │ Embedding │ │ GPT │ │ │ │ │
│ Head │ │ with │ │ Head │ │ with │ │ │ │ │
│ (F1) │ │ mean │ │ (F1) │ │ mean │ │ │ │ │
└─────┬─────┘ └──┬───┘ └─────┬─────┘ └───┬──┘ │ │ │ │
┌─┴───────────┐ └──────────┬──┇──────────┐ │ │ │ │ │
│ ┌─────────┇───┬────────┇──┘ │ │ │ │ │ │
│ │ │ │ │ ┌─────────┇───┤ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │ │ │ │
┌────────────┐ ┌─────────┐ ┌────────────┐ ┌─────────┐ ▼ ▼ ▼ ▼
│ Cosine │ │ Dot │ │ Cosine │ │ Dot │ ┌────────┐ ┌──────┐
│ Similarity │ │ Product │ │ Similarity │ │ Product │ │ TF-IDF │ │ BM25 │
└─────┬──────┘ └────┬────┘ └────────┬───┘ └────┬────┘ └───┬────┘ └──┬───┘
│ │ │ │ │ │
│ └────────────┐ │ ┌───────┘ │ │
│ │ │ │ ┌──────────────────┘ │
└───────────────────────┐ │ │ │ │ ┌───────────────────────────┘
▼ ▼ ▼ ▼ ▼ ▼
┌────────────────────┐
│ CatBoostRanker │
└─────────┬──────────┘
│
▼
┌────────────────┐
│ Date Filtering │
└───────┬────────┘
│
▼
┌───────────────────┐
│ Dynamic Threshold │
└─────────┬─────────┘
│
▼
┌─────┐
│ E_i │
└─────┘
The following table reports the results of all the models optimized on the F1 score for the fine-grained predictions. Subsequently, in the last row, the table displays the results obtained at the end of the pipeline.
Method | Recall | Precision | F1 score |
---|---|---|---|
Random | 0.0134 | 0.0021 | 0.0036 |
All Ones | 0.0314 | 0.0049 | 0.0085 |
TF-IDF | 0.3681 | 0.1437 | 0.2068 |
BM25 | 0.2887 | 0.2255 | 0.2532 |
GPT only | 0.2350 | 0.1835 | 0.2061 |
Embedding Head | 0.1933 | 0.2131 | 0.2028 |
CatBoost | 0.2708 | 0.2424 | 0.2558 |
Since the employed ensemble model is explainable, CatBoost provides the feature importances depicted in the following table. This information is useful to understand the behaviour of the model when changing the predictors values.
Feature | Value |
---|---|
Embedding Head (F1 model & Cosine similarity) | 71.0026 |
Embedding Head (F1 model & Dot Product) | 22.0739 |
GPT Only (Cosine similarity) | 3.3902 |
GPT Only (Dot Product) | 2.8006 |
TF-IDF | 0.5462 |
BM25 | 0.1865 |
The points represent the embeddings of the documents projected in UMAP learned space.
For what concerns the coloring of the points,
the essence of the approach is that we can use PCA, which preserves global structure, to reduce the data to three
dimensions. If we scale the results to fit in a 3D cube we can convert the 3D PCA coordinates of each point into an RGB
description of a color. By then coloring the points in the UMAP embedding with the colors induced by the PCA it is
possible to get a sense of how some of the more large scale global structure has been represented in the embedding.
(The quote is taken from the UMAP documentation)