dhdaines · dhdaines · Jul 16, 2024 · Jun 26, 2024 · Jun 26, 2024 · Jun 27, 2024
diff --git a/.github/workflows/analyse.yml b/.github/workflows/analyse.yml
@@ -27,6 +27,7 @@ jobs:
     - name: Install
       run: |
         python3 -m pip install --upgrade pip
+        python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
         python3 -m pip install -e .
     - name: Cache downloads
       id: cache-downloads
@@ -49,7 +50,7 @@ jobs:
         done
     - name: Extract
       run: |
-        alexi -v extract -m download/index.json download/*.pdf
+        alexi -v extract --model alexi/models/rnn_crf.pt -m download/index.json download/*.pdf
     - name: Index
       run: |
         alexi -v index export

diff --git a/README.md b/README.md
@@ -101,7 +101,7 @@ annotations.
 
 Une fois satisfait du résultat, il suffira de copier `1314-page1.csv`
 vers le repertoire `data` et réentrainer le modèle avec
-`scripts/retrain.sh`.
+`hatch run train`.
 
 Extraction de catégories pertinentes du zonage
 ----------------------------------------------

diff --git a/TODO.md b/TODO.md
@@ -1,24 +1,3 @@
-Tracking some time
-------------------
-
-- fix sous-section links 1h
-  - test case 15 min
-  - fuzzy matching of element numbers w/sequence 30 min
-  - deploy 15 min
-
-- links to categories 1h
-  - test case 15 min
-  - collect cases and implement 30 min
-
-- links to zones 2h
-  - implement zone query in ZONALDA (with centroid) 1h30
-  - extract zone links (multiple usually) 30min
-
-- links to usages 1h30
-  - test case 15 min
-  - analysis function 45 min
-  - linking as above 30 min
-
 Immediate fixes/enhancements
 ----------------------------
 
@@ -38,11 +17,102 @@ Immediate fixes/enhancements
 DERP LERNING
 ------------
 
-- Segmentation
-  - Retokenize CSVs using CamemBERT tokenizer (spread features on pieces)
-  - Train PyTorch-CRF: https://pytorch-crf.readthedocs.io/en/stable/
-  - possibly use Skorch to do evaluation: https://skorch.readthedocs.io/en/stable/
+Segmentation
+============
+
+- Retokenize CSVs using CamemBERT tokenizer (spread features on pieces) DONE
+- Train a BiLSTM model with vsl features DONE
+  - Learning rate decay and early stopping DONE
+  - Embed words and categorical features DONE
+  - Use same evaluator as CRF training for comparison DONE
+  - Scale layout features by page size and include as vector DONE
+  - Retrain from full dataset + patches
+    - early stopping? sample a dev set?
+  - Do extraction and qualitative evaluation
+    - sort for batch processing then unsort afterwards
+- CRF output layer DONE
+- Ensemble RNN DONE
+- Viterbi decoding (with allowed transitions only) on RNN outputs DONE
+  - Could *possibly* train a CRF to do this, in fact DONE
+- Tokenize from chars
+  - Add functionality to pdfplumber
+- Use Transformers for embeddings
+  - Heuristic pre-chunking as described below
+  - Either tokenize from chars (above) or use first embedding per word
+  - Probably project 768 dimensions down to something smaller
+- Do prediction with Transformers (LayoutLM) DONE
+  - heuristic chunking based on line gap (not indent) DONE
+- Do prediction with Transformers (CamemBERT)
+- Do prediction with Transformers (CamemBERT + vector feats)
+
+
+Segmentation results
+====================
 
+- Things that helped
+  - RNN helps overall, particularly on unseen data (using the
+    "patches" as a test set)
+  - use all the manually created features and embed them with >=4 dimensions
+  - deltas and delta-deltas
+  - scale all the things by page size (slightly less good than by
+    abs(max(feats)) but probably more robust)
+  - upweight B- tags by 2.0
+  - weight all tags by inverse frequency (works even better than B- * 2.0)
+  - taking the best model using f1_macro (requires ensemble or dev set)
+  - ensemble of cross-validation folds (allows early stopping as well)
+    - in *theory* dropout would give us this benefit too but no
+  - Training CRF on top of pre-trained RNN
+    - Don't constrain transitions (see below)
+    - Do freeze all RNN parameters
+    - Can just do it for one epoch if you want (if not, save the RNN outputs...)
+- Inconclusive
+  - GRU or plain RNN with lower learning rate
+    - LSTM is maybe overparameterized?
+    - Improves label accuracy quite a lot but mean F1 not really
+    - This seems to be a consequence of lower learning rate not cell typpe
+  - LayoutLM
+    - pretrained on wrong language
+    - layout features possibly suboptimal for this task
+    - but need to synchronize evaluation metrics to be sure!
+- Things that did not help
+  - CamemBERT tokenizer doesn't work well for CRFs, possibly due to:
+    - all subwords have the same position, so layout features are wrong
+    - hand-crafted features maybe don't work the same on subwords (leading _ thing)
+  - weighting classes by inverse frequency (just upweight B as it's what we care about)
+  - more LSTM layers
+  - much wider LSTM
+  - much narrower LSTM
+  - dropout on LSTM layers
+  - extra feedforward layer
+  - dropout on extra feedforward layer
+  - wider word embeddings
+  - CRF output layer (trained end-to-end)
+    - Training is *much* slower
+    - Raw accuracy is consistently a bit better.
+    - Macro-F1 though is not as good (over B- tags)
+      - Imbalanced data is an issue and weighting is more difficult
+      - Definitely weight transitions and emissions (helps)
+      - Have to weight "up", can't weight "down"
+      - Weighting by exp(1.0 / count) better than nothing
+      - Weighting by exp(1.0 / B-count) not helpful
+      - Weighting by exp(1.0 / (B-count + I-count)) not helpful
+    - Applying Viterbi to RNN output shows why
+      - Sequence constraints favour accuracy of I over B
+      - Weighted RNN training favours correct Bs, but Is can change
+        mid-sequence, which we don't care about
+      - Decoding with constraints forces B and I to agree, improving
+        overall acccuracy by fixing incorrect Is but flipping some
+        correct Bs in the process
+      - Confirmed, Viterbi with --labels bonly gives (nearly) same
+        results as non-Viterbi
+  - Training RNN-CRF with --labels bonly
+    - Not sure why since it does help for discrete CRF?!
+- Things yet to be tried
+  - pre-trained or pre-computed word embeddings
+  - label smoothing
+  - feedforward layer before RNN
+  - dropout in other places
+
 Documentation
 -------------
 
@@ -69,9 +139,7 @@ Unprioritized future stuff
   - have to hack the crfsuite file to do this
   - not at all easy to do with sklearn-crfsuite magical pickling
   - otherwise ... treat I-B as B-B when following O or I-A (as before)
-- workflow for correcting individual pages
-  - convenience functions for "visual debugging" in pdfplumber style
-  - instructions to identify and extract CSV for page
+- investigate using a different CRF library
 - tune regularization (some more)
 - compare memory footprint of main branch versus html_output
 - levels of lists

diff --git a/alexi/__init__.py b/alexi/__init__.py
@@ -23,7 +23,7 @@
 from .label import Identificateur
 from .search import search
 from .segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
-from .segment import Segmenteur
+from .segment import RNNSegmenteur, Segmenteur
 
 LOGGER = logging.getLogger("alexi")
 VERSION = "0.4.0"
@@ -59,7 +59,11 @@ def convert_main(args: argparse.Namespace):
 
 def segment_main(args: argparse.Namespace):
     """Segmenter un CSV"""
-    crf = Segmenteur(args.model)
+    crf: Segmenteur
+    if args.model.suffix == ".pt":
+        crf = RNNSegmenteur(args.model)
+    else:
+        crf = Segmenteur(args.model)
     reader = csv.DictReader(args.csv)
     write_csv(crf(reader), sys.stdout)
 
@@ -140,7 +144,7 @@ def make_argparse() -> argparse.ArgumentParser:
         "segment", help="Segmenter et étiquetter les segments d'un CSV"
     )
     segment.add_argument(
-        "--model", help="Modele CRF", type=Path, default=DEFAULT_SEGMENT_MODEL
+        "--model", help="Modele CRF ou RNN", type=Path, default=DEFAULT_SEGMENT_MODEL
     )
     segment.add_argument(
         "csv",

diff --git a/alexi/convert.py b/alexi/convert.py
@@ -5,9 +5,8 @@
 import logging
 import operator
 from collections import deque
-from io import BufferedReader, BytesIO
 from pathlib import Path
-from typing import Any, Iterable, Iterator, Optional, TextIO, Union
+from typing import Any, Iterable, Iterator, Optional, TextIO
 
 from pdfplumber import PDF
 from pdfplumber.page import Page
@@ -123,15 +122,17 @@ def get_word_features(
 
 class Converteur:
     pdf: PDF
+    path: Path
     tree: Optional[PDFStructTree]
     y_tolerance: int
 
     def __init__(
         self,
-        path_or_fp: Union[str, Path, BufferedReader, BytesIO],
+        path: Path,
         y_tolerance: int = 2,
     ):
-        self.pdf = PDF.open(path_or_fp)
+        self.pdf = PDF.open(path)
+        self.path = path
         self.y_tolerance = y_tolerance
         try:
             # Get the tree for the *entire* document since elements
@@ -180,7 +181,9 @@ def extract_words(self, pages: Optional[Iterable[int]] = None) -> Iterator[T_obj
                     continue
                 if word["x1"] > page.width or word["bottom"] > page.height:
                     continue
-                yield get_word_features(word, page, chars, elmap)
+                feats = get_word_features(word, page, chars, elmap)
+                feats["path"] = str(self.path)
+                yield feats
 
     def make_bloc(
         self, el: PDFStructElement, page_number: int, mcids: Iterable[int]

diff --git a/alexi/extract.py b/alexi/extract.py
@@ -19,7 +19,7 @@
 from alexi.label import Identificateur
 from alexi.link import Resolver
 from alexi.segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
-from alexi.segment import DEFAULT_MODEL_NOSTRUCT, Segmenteur
+from alexi.segment import DEFAULT_MODEL_NOSTRUCT, RNNSegmenteur, Segmenteur
 from alexi.types import T_obj
 
 LOGGER = logging.getLogger("extract")
@@ -39,7 +39,7 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
         help="Ne pas utiliser le CSV de référence",
         action="store_true",
     )
-    parser.add_argument("--segment-model", help="Modele CRF", type=Path)
+    parser.add_argument("--segment-model", help="Modele CRF/RNN", type=Path)
     parser.add_argument(
         "--label-model", help="Modele CRF", type=Path, default=DEFAULT_LABEL_MODEL
     )
@@ -329,6 +329,8 @@ def make_doc_tree(docs: list[Document], outdir: Path) -> dict[str, dict[str, str
 
 
 class Extracteur:
+    crf: Segmenteur
+
     def __init__(
         self,
         outdir: Path,
@@ -340,7 +342,10 @@ def __init__(
         self.outdir = outdir
         self.crf_s = Identificateur()
         if segment_model is not None:
-            self.crf = Segmenteur(segment_model)
+            if segment_model.suffix == ".pt":
+                self.crf = RNNSegmenteur(segment_model)
+            else:
+                self.crf = Segmenteur(segment_model)
             self.crf_n = None
         else:
             self.crf = Segmenteur(DEFAULT_SEGMENT_MODEL)

diff --git a/alexi/link.py b/alexi/link.py
@@ -71,6 +71,8 @@ def __call__(
         return self.resolve_internal(text, srcpath, doc)
 
     def resolve_zonage(self, text: str, srcpath: str) -> Optional[str]:
+        if self.metadata.get("zonage") is None:
+            return None
         m = MILIEU_RE.search(text)
         if m is None:
             return None

diff --git a/alexi/models/crfseq.joblib.gz b/alexi/models/crfseq.joblib.gz