Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bm25 #61

Merged
merged 36 commits into from
Aug 30, 2023
Merged

Bm25 #61

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
68b4434
organizing into classes with inheritance
Maxwell1447 Feb 21, 2023
181cd56
organizing into classes with inheritance
Maxwell1447 Feb 21, 2023
50ebddc
rename
Maxwell1447 Feb 21, 2023
f182122
add template class filter_matches
Maxwell1447 Feb 22, 2023
5963398
add template class filter_matches
Maxwell1447 Feb 22, 2023
f3677eb
progress in the computation of BM25 score and serialization
Maxwell1447 Feb 22, 2023
a16d52d
implemented BM25 score computation and sentence ranking
Maxwell1447 Feb 23, 2023
609c39f
updated gitignore
Maxwell1447 Feb 23, 2023
f4cc32c
bm25 seeming to working. Archiving versionning still deficient
Maxwell1447 Feb 24, 2023
9d584ca
merge with master
Maxwell1447 Mar 24, 2023
962745d
loading errors to be fixed
Maxwell1447 Mar 24, 2023
1227bf0
filter to be fixed
Maxwell1447 Mar 24, 2023
8f45422
fixed archive versionning
Maxwell1447 Mar 24, 2023
827d8f2
added params for bm25 buffer
Maxwell1447 Mar 24, 2023
f66d803
mini heap for faster k-best bm25 scores in register_pattern
Maxwell1447 Mar 24, 2023
1c6cbe7
clean + tests + debug: working version
Maxwell1447 Mar 27, 2023
321aa3a
working version of BM25, but still slow
Maxwell1447 Mar 30, 2023
01a9b4f
still slow but working version with no warning
Maxwell1447 Mar 31, 2023
9fdae21
cleaned code
Maxwell1447 Mar 31, 2023
c1d5727
updated README
Maxwell1447 Mar 31, 2023
071ec22
Made BM25 dependant on Eigen in CMake. If not found, BM25 unusable.
Maxwell1447 Mar 31, 2023
af8d65f
Eigen Marco on test.cc
Maxwell1447 Mar 31, 2023
4837455
Fixed macros
Maxwell1447 Mar 31, 2023
2f4cf1b
Fixed macros
Maxwell1447 Mar 31, 2023
71fba4e
Fixed macros
Maxwell1447 Mar 31, 2023
efaf0d2
Fixed CMake
Maxwell1447 Mar 31, 2023
41f0a48
removed warning in CMake
Maxwell1447 Mar 31, 2023
0ee7d94
Merge branch 'master' into BM25
Maxwell1447 Apr 14, 2023
7576ec3
memory leak fixed
Maxwell1447 Apr 25, 2023
ccee704
speedup of candidate selection with inversed index
Maxwell1447 Apr 26, 2023
c762499
progress count rate + EIGEN flag updated
Maxwell1447 Apr 26, 2023
a2db372
using multiplication instead to make the process faster with eigen sp…
Maxwell1447 May 3, 2023
3eaa617
cleaned old bm25 computation
Maxwell1447 May 3, 2023
ca0e8b4
BM25 idf ratio better explained
Maxwell1447 Aug 30, 2023
4393b74
fixed merge conflicts
Maxwell1447 Aug 30, 2023
d29e3e0
fixed arguments in apply_stream
Maxwell1447 Aug 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
build/
build*
*.sh
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,14 @@

Simplest command is the following:
```
FuzzyMatch-cli -c CORPUS [--penalty-tokens (none|tag,sep/jnr,pct,cas,nbr)] [--max-tokens-in-pattern N]
FuzzyMatch-cli -c CORPUS [--penalty-tokens (none|tag,sep/jnr,pct,cas,nbr)] [--max-tokens-in-pattern N] [--filter-type (suffix-array|bm25)] [--bm25-ratio-idf BM25_IDF_RATIO]
```

* `CORPUS` can be a single file - in which case, the index of each segment is simply the sentence id - or you can provide a target file using `-c CORPUSSOURCE,CORPUSTARGET` and add option `--add-target` to include in the index the actual target sentence (format ID=target). This is useful for having the index fully containing the translation memory. Not useful, if the translation memory is saved in side database.
* `--penalty-tokens` (default `tag,cas,nbr`) is either `none` or comma-separated list of `tag`, `sep`, `jnr`, `pct`, `nbr`, `cas` modifying normalization (for `cas` performing case normalization, and `nbr` triggering number normalization), removing some tokens from index (`tag` for tags, and `pct` for punctuations), or generates spacer/joiner (`sep`/`jnr`). In each case, a penalty tokens is added.
* `--max-tokens-in-pattern` (default: 300) limits how long the pattern can be. This is necessary to prevent poor match performance, because the edit distance computation runs in O(T^2) where T is the number of tokens in the pattern.
* `--filter-type` switch between `suffix-array` or `bm25` (default `suffix-array`).
* `BM25IDFRATIO` (default: 0.5) prefilter to fasten BM25. The value is in (0, 1) and corresponds to the maximum ratio of sentences in memory containing a certain term before building a reverse index for this term. Reasonable ratios can easily be around 0.1, which implies that indexed terms are in at most 10% of the sentences.

This option used in index forces the same logic in matching.

Expand All @@ -20,17 +22,20 @@ The above command generates a file `CORPUS.fmi`.
## Fuzzy Lookup

```
FuzzyMatch-cli -i CORPUS.fmi -a match -f FUZZY -N NTHREAD -n NMATCH [--ml ML] [--mr MR] --idf-penalty IDFPENALTYRATIO --insert-cost ICOST --delete-cost DCOST --replace-cost RCOST --contrast CONTRASTFACTOR --contrast-buffer CONTRASTBUFF < INPUTFILE > MATCHES
FuzzyMatch-cli -i CORPUS.fmi -a match -f FUZZY -N NTHREAD -n NMATCH --filter-type FILTERTYPE [--ml ML] [--mr MR] [--bm25-buffer BM25BUFF] [--bm25-cutoff BM25CUTOFF] --idf-penalty IDFPENALTYRATIO --insert-cost ICOST --delete-cost DCOST --replace-cost RCOST --contrast CONTRASTFACTOR --contrast-buffer CONTRASTBUFF < INPUTFILE > MATCHES
```

* `CORPUS.fmi` path to the complete generated index file
* `FUZZY`, the fuzzy threshold in [0,1]. Not really relevant < 0.5.
* `NTHREAD` number of thread to use - default 4. Scales well with the number of threads.
* `NMATCH` number of match to return
* `NMATCH` number of match to return.
* `FILTERTYPE` switch between `suffix-array` or `bm25` (default `suffix-array`).
* `ML` minimal length of the longest subsequence (in tokens) - defaut 3. If the pattern size is strictly less than `ML`, then this parameter is ignored.
* `MR` minimal ratio of the longest subsequence (in tokens) - default 0. Interesting to use for lowest fuzzy - for instance a value of 0.5, used with fuzzy threshold 0.5, will guarantee the presence of at least 50% of the sentence length
* `MR` minimal ratio of the longest subsequence (in tokens) - default 0. Interesting to use for lowest fuzzy - for instance a value of 0.5, used with fuzzy threshold 0.5, will guarantee the presence of at least 50% of the sentence length.
* `BM25BUFF` number of best BM25 to rerank with edit distance. The default is 10.
* `BM25CUTOFF` minimum BM25 score threshold cutoff. The default is 0.
* `IDFPENALTYRATIO` if not null, gives extra penalty to word missing weighted on IDF: a value of 1 is equivalent to give a penalty of one additional missing word for a word appearing only once in all the translation memory.
* `ICOST`, `DCOST`, `RCOST`, positive real values, respectively costs for *insertion*, *deletion* and *replace* in the edit distance. The defalut are 1, 1, 1. For coverage similarity, choose 1, 0, 1.
* `ICOST`, `DCOST`, `RCOST`, positive real values, respectively costs for *insertion*, *deletion* and *replace* in the edit distance. The default are 1, 1, 1. For coverage similarity, choose 1, 0, 1.
* `CONTRAST` contrastive factor for iterative contrastive retrieval (see [paper](https://aclanthology.org/2022.emnlp-main.235/)). Default is 0. The greater, the more diversity in the retrieved sequences.
* `CONTRASTBUFF` contrastive buffer (default `NMATCH`, only useful when `CONTRAST`>0) is the number of candidates with highest matches considered for contrastive reranking. If not set, it will just rerank the `NMATCH` scores.

Expand Down Expand Up @@ -77,9 +82,13 @@ Also, the more generous is the "normalization", the better the fuzzy matcher wil
Fuzzy matching is performed on normalized forms, but final score is calculated on real forms with potential penalties.


## Suffix Arrays and Fuzzy Matching
## Suffix Arrays, BM25 and Edit distance

The base structure used for fuzzy matching index is a suffix array. This structure is implemented in `suffix-array.hh`and works as following: each sentence can be seen as a sequence of suffix. For instance the sentence `A B C A D` is containing suffixes `A B C A D`, `B C A D`, `C A D`, `A D` and `D`. In a suffix array the suffix are sorted by lexicographical order and suffixes are simply indicated as their position in the sentence.
### Suffix Arrays

The base structure used for fuzzy matching index is a suffix array.
Its role is to act as a filter to avoid computing edit distance with every single candidate. We identify the candidates containing a common n-gram with the query (pattern) that covers at least a certain ratio and has a minimal length.
The structure of suffix array is implemented in `suffix-array.hh`and works as following: each sentence can be seen as a sequence of suffix. For instance the sentence `A B C A D` is containing suffixes `A B C A D`, `B C A D`, `C A D`, `A D` and `D`. In a suffix array the suffix are sorted by lexicographical order and suffixes are simply indicated as their position in the sentence.

The suffix array corresponding to the sentences 0: `A B C A D` and 1: `D A B A` is:

Expand Down Expand Up @@ -144,6 +153,15 @@ These occurrences might overlap, so a second step in the process is building a `

During the match process, we can restrict the candidate sentences by looking at their length. Indeed a 70% fuzzy on a 100 tokens match can not match sentence shorter than 70 tokens or longer than 130 tokens.

### BM25

An alternative to Suffix Array algorithm is Okapi BM25. This is a ranking function using TF-IDF-like structures ([see https://en.wikipedia.org/wiki/Okapi_BM25](https://en.wikipedia.org/wiki/Okapi_BM25)).
The idea is to only consider the k best scoring sentences as well as the sentences scoring above a certain threshold to compute edit distance on.

The indexed values of BM25 scores (term, sentence_id) are stored in a sparse matrix, requiring package Eigen. If not found, BM25 will not compile, and is not usable

### Edit distance

Last phase of the fuzzy match is to actually perform a standard edit distance between the *unnormalized* tokens to obtain actual fuzzy match.

Following rules apply to calculate the actual fuzzy match when looking for a specific pattern:
Expand Down
52 changes: 44 additions & 8 deletions cli/src/FuzzyMatch-cli.cc
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
#include <chrono>
#include <ctime>

#include <fuzzy/index.hh>
#include <fuzzy/filter.hh>
#include <fuzzy/costs.hh>
#include <fuzzy/fuzzy_match.hh>
#include <fuzzy/fuzzy_matcher_binarization.hh>
Expand Down Expand Up @@ -127,7 +129,10 @@ std::pair<int, int> process_stream(const Function& function,
if (!res.empty())
count_nonempty++;
out << res << std::endl;
if (count_nonempty % 100 == 0)
std::cerr << "\rPROGRESS: " << count_nonempty << " " << std::flush;
}
std::cerr << std::endl;
return std::make_pair(count_nonempty, count_total);
}

Expand Down Expand Up @@ -158,6 +163,8 @@ std::pair<int, int> process_stream(const Function& function,
count_nonempty++;
out << res << std::endl;
futures.pop();
if (count_nonempty % 100 == 0)
std::cerr << "\rPROGRESS: " << count_nonempty << " " << std::flush;
}
};

Expand All @@ -179,6 +186,8 @@ std::pair<int, int> process_stream(const Function& function,

if (!futures.empty())
pop_results(/*blocking=*/true);

std::cerr << std::endl;

{
std::lock_guard<std::mutex> lock(mutex);
Expand All @@ -199,8 +208,10 @@ class processor {
float idf_penalty, bool subseq_idf_weighting,
size_t max_tokens_in_pattern, fuzzy::EditCosts edit_cost,
std::string contrastive_reduce_str,
int contrastive_buffer):
_fuzzyMatcher(pt, max_tokens_in_pattern),
int contrastive_buffer,
fuzzy::IndexType filter_type,
int bm25_buffer, float bm25_cutoff, const fuzzy::FilterIndexParams& filter_index_params):
_fuzzyMatcher(pt, max_tokens_in_pattern, filter_type, filter_index_params),
_fuzzy(fuzzy),
_contrastive_factor(contrastive_factor),
_nmatch(nmatch),
Expand All @@ -210,7 +221,10 @@ class processor {
_idf_penalty(idf_penalty),
_subseq_idf_weighting(subseq_idf_weighting),
_cost(edit_cost),
_contrastive_buffer(contrastive_buffer) {
_contrastive_buffer(contrastive_buffer),
_filter_type(filter_type),
_bm25_buffer(bm25_buffer),
_bm25_cutoff(bm25_cutoff) {
if (contrastive_reduce_str == "max")
_contrastive_reduce = fuzzy::ContrastReduce::MAX;
else
Expand All @@ -221,7 +235,8 @@ class processor {

_fuzzyMatcher.match(sentence, _fuzzy, _nmatch, _no_perfect, matches,
_min_subseq_length, _min_subseq_ratio, _idf_penalty, _cost,
_contrastive_factor, _contrastive_reduce, _contrastive_buffer);
_contrastive_factor, _contrastive_reduce, _contrastive_buffer,
_filter_type, _bm25_buffer, _bm25_cutoff);

std::string out;
for(const fuzzy::FuzzyMatch::Match &m: matches) {
Expand Down Expand Up @@ -276,6 +291,9 @@ class processor {
fuzzy::EditCosts _cost;
fuzzy::ContrastReduce _contrastive_reduce;
int _contrastive_buffer;
fuzzy::IndexType _filter_type;
int _bm25_buffer;
float _bm25_cutoff;
};

int main(int argc, char** argv)
Expand All @@ -299,16 +317,20 @@ int main(int argc, char** argv)
std::string index_file;
std::string penalty_tokens;
std::string contrastive_reduce;
std::string filter_type_str;
float idf_penalty;
float insert_cost;
float delete_cost;
float replace_cost;
float fuzzy;
float contrastive_factor;
float bm25_cutoff;
float bm25_ratio_idf;
int nmatch;
int nthreads;
int min_subseq_length;
int contrastive_buffer;
int bm25_buffer;
float min_subseq_ratio;
size_t max_tokens_in_pattern;
fuzzyOptions.add_options()
Expand Down Expand Up @@ -337,8 +359,12 @@ int main(int argc, char** argv)
("subseq-idf-weighting,w", po::bool_switch(), "use idf weighting in finding longest subsequence")
("max-tokens-in-pattern", po::value(&max_tokens_in_pattern)->default_value(fuzzy::DEFAULT_MAX_TOKENS_IN_PATTERN), "Patterns containing more tokens than this value are ignored")
("contrast", po::value(&contrastive_factor)->default_value(0.f), "Contrastive factor for contrastive fuzzy retrieval")
("contrast-reduce", po::value(&contrastive_reduce)->default_value("mean"), "Contrastive factor for contrastive fuzzy retrieval")
("contrast-reduce", po::value(&contrastive_reduce)->default_value("mean"), "Contrastive factor for contrastive fuzzy retrieval (mean, max)")
("filter-type", po::value(&filter_type_str)->default_value("suffix-array"), "Type of filter used (suffix-array, bm25)")
("contrast-buffer", po::value(&contrastive_buffer)->default_value(-1), "number of fuzzy matches to place in the buffer")
("bm25-ratio-idf", po::value(&bm25_ratio_idf)->default_value(0.5f), "filter in the reverse index to consider only terms rare enough (close to 0 = ignores a lot : close to 1 = considers a lot)")
("bm25-buffer", po::value(&bm25_buffer)->default_value(10), "number of best BM25 to rerank")
("bm25-cutoff", po::value(&bm25_cutoff)->default_value(0.f), "minimum BM25 score threshold cutoff")
("nthreads,N", po::value(&nthreads)->default_value(4), "number of thread to use for match")
;

Expand Down Expand Up @@ -408,11 +434,21 @@ int main(int argc, char** argv)
}

fuzzy::EditCosts edit_cost(insert_cost, delete_cost, replace_cost);
fuzzy::IndexType filter_type;
if (filter_type_str == "bm25")
filter_type = fuzzy::IndexType::BM25;
else
filter_type = fuzzy::IndexType::SUFFIX;
#ifdef NO_EIGEN
assert(filter_type != fuzzy::IndexType::BM25);
#endif
const fuzzy::FilterIndexParams filter_index_params(bm25_ratio_idf, 1.5, 0.75);
processor O(pt, fuzzy, contrastive_factor, nmatch, no_perfect,
min_subseq_length, min_subseq_ratio,
idf_penalty, subseq_idf_weighting,
max_tokens_in_pattern, edit_cost,
contrastive_reduce, contrastive_buffer);
contrastive_reduce, contrastive_buffer,
filter_type, bm25_buffer, bm25_cutoff, filter_index_params);

if (index_file.length()) {
TICK("Loading index_file: "+index_file);
Expand All @@ -428,8 +464,8 @@ int main(int argc, char** argv)
return 2;
}

TICK("Sorting Index");
O._fuzzyMatcher.sort();
TICK("Preparing Index");
O._fuzzyMatcher.prepare();

// work
if (action == "index")
Expand Down
95 changes: 95 additions & 0 deletions include/fuzzy/bm25.hh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#pragma once

#include <string>
#include <vector>
#include <iostream>
#include <ostream>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <math.h>
#include <Eigen/Sparse>

#include <fuzzy/utils.hh>
#include <fuzzy/filter.hh>

#include <boost/multi_array.hpp>
#include <boost/format.hpp>
#include <boost/container/vector.hpp>
#include <boost/unordered_map.hpp>
#include <boost/serialization/unordered_map.hpp>
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/vector.hpp>
#include <boost/serialization/split_member.hpp>
#include <boost/serialization/version.hpp>
#include <boost/serialization/array.hpp>

namespace fuzzy
{
typedef Eigen::SparseMatrix<float> SpMat;
typedef Eigen::Triplet<float> Triplet;
// Sentence ID -> BM25-score
class BM25 : public Filter
{
public:
BM25(const FilterIndexParams &params=FilterIndexParams());
~BM25();
unsigned add_sentence(const std::vector<unsigned>& sentence) override;

using Filter::dump;
using Filter::num_sentences;
using Filter::get_sentence;

void prepare(size_t vocab_size);

std::ostream& dump(std::ostream&) const;

unsigned get_sentence_length(size_t s_id) const;

float bm25_score_pattern(
unsigned s_id,
std::vector<unsigned> pattern_wids) const;

float bm25_score(
int term,
int s_id,
float avg_doc_length,
float tf,
std::vector<float>& idf);

std::vector<std::vector<int>> get_vec_candidates(const std::vector<unsigned>& pattern_wids) const;

inline int get_vocab_size() const { return _vocab_size; }
Eigen::SparseVector<float> compute_product(const Eigen::SparseVector<float>& pattern_voc) const;

private:
size_t _vocab_size;

// inverse index to access sentences that contain a given term, to be serialized
std::unordered_map<int, std::vector<int>> _inverse_index;
// BM25 (t, d) to be serialized
std::vector<std::pair<std::pair<int, int>, float>> _key_value_bm25;
// Sparse matrix of BM25 (t, d) cache
SpMat _bm25_inverse_index;

// BM25 usual parameters
const float _k1;
const float _b;
// Prefilter reverse index idf ratio
const float _ratio_idf;

friend class boost::serialization::access;

template<class Archive>
void save(Archive&, unsigned int version) const;

template<class Archive>
void load(Archive&, unsigned int version);

BOOST_SERIALIZATION_SPLIT_MEMBER()
};
}

BOOST_CLASS_VERSION(fuzzy::BM25, 1)

#include "fuzzy/bm25.hxx"
Loading