SYSTRAN · guillaumekln · Aug 30, 2023 · Feb 21, 2023 · Feb 21, 2023 · Feb 21, 2023
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,2 @@
-build/
+build*
 *.sh
diff --git a/README.md b/README.md
@@ -6,12 +6,14 @@
 
 Simplest command is the following:
 ```
-FuzzyMatch-cli -c CORPUS [--penalty-tokens (none|tag,sep/jnr,pct,cas,nbr)] [--max-tokens-in-pattern N]
+FuzzyMatch-cli -c CORPUS [--penalty-tokens (none|tag,sep/jnr,pct,cas,nbr)] [--max-tokens-in-pattern N] [--filter-type (suffix-array|bm25)] [--bm25-ratio-idf BM25_IDF_RATIO]
 ```
 
 * `CORPUS` can be a single file - in which case, the index of each segment is simply the sentence id - or you can provide a target file using `-c CORPUSSOURCE,CORPUSTARGET` and add option `--add-target` to include in the index the actual target sentence (format ID=target). This is useful for having the index fully containing the translation memory. Not useful, if the translation memory is saved in side database.
 * `--penalty-tokens` (default `tag,cas,nbr`) is either `none` or comma-separated list of `tag`, `sep`, `jnr`, `pct`, `nbr`, `cas` modifying normalization (for `cas` performing case normalization, and `nbr` triggering number normalization), removing some tokens from index (`tag` for tags, and `pct` for punctuations), or generates spacer/joiner (`sep`/`jnr`). In each case, a penalty tokens is added.
 * `--max-tokens-in-pattern` (default: 300) limits how long the pattern can be. This is necessary to prevent poor match performance, because the edit distance computation runs in O(T^2) where T is the number of tokens in the pattern.
+* `--filter-type` switch between `suffix-array` or `bm25` (default `suffix-array`).
+* `BM25IDFRATIO` (default: 0.5) prefilter to fasten BM25. The value is in (0, 1) and corresponds to the maximum ratio of sentences in memory containing a certain term before building a reverse index for this term. Reasonable ratios can easily be around 0.1, which implies that indexed terms are in at most 10% of the sentences.
 
 This option used in index forces the same logic in matching.
 
@@ -20,17 +22,20 @@ The above command generates a file `CORPUS.fmi`.
 ## Fuzzy Lookup
 
 ```
-FuzzyMatch-cli -i CORPUS.fmi -a match -f FUZZY -N NTHREAD -n NMATCH [--ml ML] [--mr MR] --idf-penalty IDFPENALTYRATIO --insert-cost ICOST --delete-cost DCOST --replace-cost RCOST --contrast CONTRASTFACTOR --contrast-buffer CONTRASTBUFF < INPUTFILE > MATCHES 
+FuzzyMatch-cli -i CORPUS.fmi -a match -f FUZZY -N NTHREAD -n NMATCH --filter-type FILTERTYPE [--ml ML] [--mr MR] [--bm25-buffer BM25BUFF] [--bm25-cutoff BM25CUTOFF] --idf-penalty IDFPENALTYRATIO --insert-cost ICOST --delete-cost DCOST --replace-cost RCOST --contrast CONTRASTFACTOR --contrast-buffer CONTRASTBUFF < INPUTFILE > MATCHES 
 ```
 
 * `CORPUS.fmi` path to the complete generated index file
 * `FUZZY`, the fuzzy threshold in [0,1]. Not really relevant < 0.5.
 * `NTHREAD` number of thread to use - default 4. Scales well with the number of threads.
-* `NMATCH` number of match to return
+* `NMATCH` number of match to return.
+* `FILTERTYPE` switch between `suffix-array` or `bm25` (default `suffix-array`).
 * `ML` minimal length of the longest subsequence (in tokens) - defaut 3. If the pattern size is strictly less than `ML`, then this parameter is ignored.
-* `MR` minimal ratio of the longest subsequence (in tokens) - default 0. Interesting to use for lowest fuzzy - for instance a value of 0.5, used with fuzzy threshold 0.5, will guarantee the presence of at least 50% of the sentence length
+* `MR` minimal ratio of the longest subsequence (in tokens) - default 0. Interesting to use for lowest fuzzy - for instance a value of 0.5, used with fuzzy threshold 0.5, will guarantee the presence of at least 50% of the sentence length.
+* `BM25BUFF` number of best BM25 to rerank with edit distance. The default is 10.
+* `BM25CUTOFF` minimum BM25 score threshold cutoff. The default is 0.
 * `IDFPENALTYRATIO` if not null, gives extra penalty to word missing weighted on IDF: a value of 1 is equivalent to give a penalty of one additional missing word for a word appearing only once in all the translation memory.
-* `ICOST`, `DCOST`, `RCOST`, positive real values, respectively costs for *insertion*, *deletion* and *replace* in the edit distance. The defalut are 1, 1, 1. For coverage similarity, choose 1, 0, 1.
+* `ICOST`, `DCOST`, `RCOST`, positive real values, respectively costs for *insertion*, *deletion* and *replace* in the edit distance. The default are 1, 1, 1. For coverage similarity, choose 1, 0, 1.
 * `CONTRAST` contrastive factor for iterative contrastive retrieval (see [paper](https://aclanthology.org/2022.emnlp-main.235/)). Default is 0. The greater, the more diversity in the retrieved sequences.
 * `CONTRASTBUFF` contrastive buffer (default `NMATCH`, only useful when `CONTRAST`>0) is the number of candidates with highest matches considered for contrastive reranking. If not set, it will just rerank the `NMATCH` scores.
 
@@ -77,9 +82,13 @@ Also, the more generous is the "normalization", the better the fuzzy matcher wil
 Fuzzy matching is performed on normalized forms, but final score is calculated on real forms with potential penalties.
 
 
-## Suffix Arrays and Fuzzy Matching
+## Suffix Arrays, BM25 and Edit distance
 
-The base structure used for fuzzy matching index is a suffix array. This structure is implemented in `suffix-array.hh`and works as following: each sentence can be seen as a sequence of suffix. For instance the sentence `A B C A D` is containing suffixes `A B C A D`, `B C A D`, `C A D`, `A D` and `D`. In a suffix array the suffix are sorted by lexicographical order and suffixes are simply indicated as their position in the sentence.
+### Suffix Arrays
+
+The base structure used for fuzzy matching index is a suffix array.
+Its role is to act as a filter to avoid computing edit distance with every single candidate. We identify the candidates containing a common n-gram with the query (pattern) that covers at least a certain ratio and has a minimal length.
+The structure of suffix array is implemented in `suffix-array.hh`and works as following: each sentence can be seen as a sequence of suffix. For instance the sentence `A B C A D` is containing suffixes `A B C A D`, `B C A D`, `C A D`, `A D` and `D`. In a suffix array the suffix are sorted by lexicographical order and suffixes are simply indicated as their position in the sentence.
 
 The suffix array corresponding to the sentences 0: `A B C A D` and 1: `D A B A` is:
 
@@ -144,6 +153,15 @@ These occurrences might overlap, so a second step in the process is building a `
 
 During the match process, we can restrict the candidate sentences by looking at their length. Indeed a 70% fuzzy on a 100 tokens match can not match sentence shorter than 70 tokens or longer than 130 tokens.
 
+### BM25
+
+An alternative to Suffix Array algorithm is Okapi BM25. This is a ranking function using TF-IDF-like structures ([see https://en.wikipedia.org/wiki/Okapi_BM25](https://en.wikipedia.org/wiki/Okapi_BM25)).
+The idea is to only consider the k best scoring sentences as well as the sentences scoring above a certain threshold to compute edit distance on.
+
+The indexed values of BM25 scores (term, sentence_id) are stored in a sparse matrix, requiring package Eigen. If not found, BM25 will not compile, and is not usable
+
+### Edit distance
+
 Last phase of the fuzzy match is to actually perform a standard edit distance between the *unnormalized* tokens to obtain actual fuzzy match.
 
 Following rules apply to calculate the actual fuzzy match when looking for a specific pattern:

diff --git a/cli/src/FuzzyMatch-cli.cc b/cli/src/FuzzyMatch-cli.cc
@@ -14,6 +14,8 @@
 #include <chrono>
 #include <ctime>
 
+#include <fuzzy/index.hh>
+#include <fuzzy/filter.hh>
 #include <fuzzy/costs.hh>
 #include <fuzzy/fuzzy_match.hh>
 #include <fuzzy/fuzzy_matcher_binarization.hh>
@@ -127,7 +129,10 @@ std::pair<int, int> process_stream(const Function& function,
       if (!res.empty())
         count_nonempty++;
       out << res << std::endl;
+      if (count_nonempty % 100 == 0)
+        std::cerr << "\rPROGRESS: " << count_nonempty << "  " << std::flush;
     }
+    std::cerr << std::endl;
     return std::make_pair(count_nonempty, count_total);
   }
 
@@ -158,6 +163,8 @@ std::pair<int, int> process_stream(const Function& function,
         count_nonempty++;
       out << res << std::endl;
       futures.pop();
+      if (count_nonempty % 100 == 0)
+        std::cerr << "\rPROGRESS: " << count_nonempty << "  " << std::flush;
     }
   };
 
@@ -179,6 +186,8 @@ std::pair<int, int> process_stream(const Function& function,
 
   if (!futures.empty())
     pop_results(/*blocking=*/true);
+
+  std::cerr << std::endl;
 
   {
     std::lock_guard<std::mutex> lock(mutex);
@@ -199,8 +208,10 @@ class processor {
             float idf_penalty, bool subseq_idf_weighting,
             size_t max_tokens_in_pattern, fuzzy::EditCosts edit_cost,
             std::string contrastive_reduce_str,
-            int contrastive_buffer):
-             _fuzzyMatcher(pt, max_tokens_in_pattern),
+            int contrastive_buffer,
+            fuzzy::IndexType filter_type,
+            int bm25_buffer, float bm25_cutoff, const fuzzy::FilterIndexParams& filter_index_params):
+             _fuzzyMatcher(pt, max_tokens_in_pattern, filter_type, filter_index_params),
              _fuzzy(fuzzy),
              _contrastive_factor(contrastive_factor),
              _nmatch(nmatch),
@@ -210,7 +221,10 @@ class processor {
              _idf_penalty(idf_penalty),
              _subseq_idf_weighting(subseq_idf_weighting),
              _cost(edit_cost),
-             _contrastive_buffer(contrastive_buffer) {
+             _contrastive_buffer(contrastive_buffer),
+             _filter_type(filter_type),
+             _bm25_buffer(bm25_buffer),
+             _bm25_cutoff(bm25_cutoff) {
     if (contrastive_reduce_str == "max")
       _contrastive_reduce = fuzzy::ContrastReduce::MAX;
     else
@@ -221,7 +235,8 @@ class processor {
 
     _fuzzyMatcher.match(sentence, _fuzzy, _nmatch, _no_perfect, matches,
                         _min_subseq_length, _min_subseq_ratio, _idf_penalty, _cost,
-                        _contrastive_factor, _contrastive_reduce, _contrastive_buffer);
+                        _contrastive_factor, _contrastive_reduce, _contrastive_buffer,
+                        _filter_type, _bm25_buffer, _bm25_cutoff);
 
     std::string   out;
     for(const fuzzy::FuzzyMatch::Match &m: matches) {
@@ -276,6 +291,9 @@ class processor {
   fuzzy::EditCosts _cost;
   fuzzy::ContrastReduce _contrastive_reduce;
   int _contrastive_buffer;
+  fuzzy::IndexType _filter_type;
+  int _bm25_buffer;
+  float _bm25_cutoff;
 };
 
 int main(int argc, char** argv)
@@ -299,16 +317,20 @@ int main(int argc, char** argv)
   std::string index_file;
   std::string penalty_tokens;
   std::string contrastive_reduce;
+  std::string filter_type_str;
   float idf_penalty;
   float insert_cost;
   float delete_cost;
   float replace_cost;
   float fuzzy;
   float contrastive_factor;
+  float bm25_cutoff;
+  float bm25_ratio_idf;
   int nmatch;
   int nthreads;
   int min_subseq_length;
   int contrastive_buffer;
+  int bm25_buffer;
   float min_subseq_ratio;
   size_t max_tokens_in_pattern;
   fuzzyOptions.add_options()
@@ -337,8 +359,12 @@ int main(int argc, char** argv)
     ("subseq-idf-weighting,w", po::bool_switch(), "use idf weighting in finding longest subsequence")
     ("max-tokens-in-pattern", po::value(&max_tokens_in_pattern)->default_value(fuzzy::DEFAULT_MAX_TOKENS_IN_PATTERN), "Patterns containing more tokens than this value are ignored")
     ("contrast", po::value(&contrastive_factor)->default_value(0.f), "Contrastive factor for contrastive fuzzy retrieval")
-    ("contrast-reduce", po::value(&contrastive_reduce)->default_value("mean"), "Contrastive factor for contrastive fuzzy retrieval")
+    ("contrast-reduce", po::value(&contrastive_reduce)->default_value("mean"), "Contrastive factor for contrastive fuzzy retrieval (mean, max)")
+    ("filter-type", po::value(&filter_type_str)->default_value("suffix-array"), "Type of filter used (suffix-array, bm25)")
     ("contrast-buffer", po::value(&contrastive_buffer)->default_value(-1), "number of fuzzy matches to place in the buffer")    
+    ("bm25-ratio-idf", po::value(&bm25_ratio_idf)->default_value(0.5f), "filter in the reverse index to consider only terms rare enough (close to 0 = ignores a lot : close to 1 = considers a lot)")
+    ("bm25-buffer", po::value(&bm25_buffer)->default_value(10), "number of best BM25 to rerank")
+    ("bm25-cutoff", po::value(&bm25_cutoff)->default_value(0.f), "minimum BM25 score threshold cutoff")
     ("nthreads,N", po::value(&nthreads)->default_value(4), "number of thread to use for match")
     ;
 
@@ -408,11 +434,21 @@ int main(int argc, char** argv)
   }
 
   fuzzy::EditCosts edit_cost(insert_cost, delete_cost, replace_cost);
+  fuzzy::IndexType filter_type;
+  if (filter_type_str == "bm25")
+    filter_type = fuzzy::IndexType::BM25;
+  else
+    filter_type = fuzzy::IndexType::SUFFIX;
+#ifdef NO_EIGEN
+  assert(filter_type != fuzzy::IndexType::BM25);
+#endif
+  const fuzzy::FilterIndexParams filter_index_params(bm25_ratio_idf, 1.5, 0.75);
   processor O(pt, fuzzy, contrastive_factor, nmatch, no_perfect,
               min_subseq_length, min_subseq_ratio,
               idf_penalty, subseq_idf_weighting,
               max_tokens_in_pattern, edit_cost,
-              contrastive_reduce, contrastive_buffer);
+              contrastive_reduce, contrastive_buffer,
+              filter_type, bm25_buffer, bm25_cutoff, filter_index_params);
 
   if (index_file.length()) {
     TICK("Loading index_file: "+index_file);
@@ -428,8 +464,8 @@ int main(int argc, char** argv)
       return 2;
     }
 
-    TICK("Sorting Index");
-    O._fuzzyMatcher.sort();
+    TICK("Preparing Index");
+    O._fuzzyMatcher.prepare();
 
     // work
     if (action == "index")

diff --git a/include/fuzzy/bm25.hh b/include/fuzzy/bm25.hh
@@ -0,0 +1,95 @@
+#pragma once
+
+#include <string>
+#include <vector>
+#include <iostream>
+#include <ostream>
+#include <algorithm>
+#include <unordered_set>
+#include <unordered_map>
+#include <math.h> 
+#include <Eigen/Sparse>
+
+#include <fuzzy/utils.hh>
+#include <fuzzy/filter.hh>
+
+#include <boost/multi_array.hpp>
+#include <boost/format.hpp>
+#include <boost/container/vector.hpp>
+#include <boost/unordered_map.hpp>
+#include <boost/serialization/unordered_map.hpp>
+#include <boost/serialization/serialization.hpp>
+#include <boost/serialization/vector.hpp>
+#include <boost/serialization/split_member.hpp>
+#include <boost/serialization/version.hpp>
+#include <boost/serialization/array.hpp>
+
+namespace fuzzy
+{
+  typedef Eigen::SparseMatrix<float> SpMat;
+  typedef Eigen::Triplet<float> Triplet;
+  // Sentence ID -> BM25-score
+  class BM25 : public Filter
+  {
+  public:
+    BM25(const FilterIndexParams &params=FilterIndexParams());
+    ~BM25();
+    unsigned add_sentence(const std::vector<unsigned>& sentence) override;
+
+    using Filter::dump;
+    using Filter::num_sentences;
+    using Filter::get_sentence;
+
+    void prepare(size_t vocab_size);
+
+    std::ostream& dump(std::ostream&) const;
+
+    unsigned get_sentence_length(size_t s_id) const;
+
+    float bm25_score_pattern(
+      unsigned s_id,
+      std::vector<unsigned> pattern_wids) const;
+
+    float bm25_score(
+      int term,
+      int s_id,
+      float avg_doc_length,
+      float tf,
+      std::vector<float>& idf);
+
+    std::vector<std::vector<int>> get_vec_candidates(const std::vector<unsigned>& pattern_wids) const;
+
+    inline int get_vocab_size() const { return _vocab_size; }
+    Eigen::SparseVector<float> compute_product(const Eigen::SparseVector<float>& pattern_voc) const;
+
+  private:
+    size_t _vocab_size;
+
+    // inverse index to access sentences that contain a given term, to be serialized
+    std::unordered_map<int, std::vector<int>> _inverse_index;
+    // BM25 (t, d) to be serialized
+    std::vector<std::pair<std::pair<int, int>, float>> _key_value_bm25;
+    // Sparse matrix of BM25 (t, d) cache
+    SpMat _bm25_inverse_index;
+
+    // BM25 usual parameters
+    const float _k1;
+    const float _b;
+    // Prefilter reverse index idf ratio
+    const float _ratio_idf;
+
+    friend class boost::serialization::access;
+
+    template<class Archive>
+    void save(Archive&, unsigned int version) const;
+
+    template<class Archive>
+    void load(Archive&, unsigned int version);
+
+    BOOST_SERIALIZATION_SPLIT_MEMBER()
+  };
+}
+
+BOOST_CLASS_VERSION(fuzzy::BM25, 1)
+
+#include "fuzzy/bm25.hxx"