Skip to content

Commit

Permalink
Utilser de l'apprentissage profond .. wouhou (#28)
Browse files Browse the repository at this point in the history
* chore: update baselines

* refactor: rendre filter_tab reutilisable

* refactor: remove useless results

* docs: update todo

* feat: prendre en charge tokenisation

Mais! Fais pas ca ... ca marche pas bien

* feat: RNN training

* feat: same eval as train_crf, results not too bad

* refactor: reorganize 1

* fix: do cross validation properly

* feat: add (scaled) vector features

* fix: slightly better scaling

* feat: more features

* docs: record of scores

* fix: featnames redundant

* feat: use the knowledge based features too

* fix: use 4 dimensions

* feat: embed things and report

* feat: take best model

* fix: scale by page height for robust

* feat: 64

* feat: weight up B

* fix: smaller

* fix: same width for all embeddings is best

* docs: some results

* docs: more todo

* docs: todo

* refactor: refactor

* feat: CRF output layer, working finally!

* fix: horrible python error

* fix: lower lr

* feat: more parameters

* docs: scores

* docs: more todos

* refactor: move CSV without PDF to a subdirectory

* chore: retrain

* feat: derp lerning for the win!

* feat: refactor out rnn/rnncrf stuff and add dimensions

* feat: parameters (and best results of search)

* docs: tdo

* feat: better updates

* feat: data processing for LayoutLM

* fix: Figure a pas daffaire la

* fix: normalize box

* fix: ensure box

* fix: remove bogus line

* fix: make repeatable so it does not seem like random crashing!

* fix: fix some errors

* fix(test): fix test

* fix: dropout not useful

* chore: update scores

* feat: activer mode test pour rnn

* docs: todo

* fix: poutyne removal

* feat: equivalence RNN/CRF pour segmentation

* fix: si y a pas de zonage

* fix: no need for batch in predict

* feat: do not early stop by default

* feat: cross validation

* docs: cross validation layoutlm

* feat: try to weight

* feat: weight labels

* chore: updates

* feat: support bonly, tonly, iobonly

* feat: use tonly

* docs: various only results

* docs: more scores

* Revert "fix: Figure a pas daffaire la"

This reverts commit 27a39e1.

* refactor: patches to patches

* fix: better crf-rnn with allennlp

* feat: try label weights

* feat: make train_rnn_crf work with acc/f1 and weights

* feat: synchroniser train_rnn_crf avec train_rnn

* feat: standardize rnn and rnncrf scripts

* feat: test with majority vote rnn

* fix: label weights are exponential it seems

* fix: weight transitions too (it is better)

* docs: minor updates

* docs: scores for best rnn-crf

* fix: no need for separate test_rnn_crf

* feat: add voting for CRF

* feat: reuse RNN code

* feat: enable --labels bonly and decoding

* feat: initialize RNN-CRF from RNN (helps a lot)

* feat: train and support RNN and RNN-CRF

* feat: add rnn+crf training

* chore: retrain

* feat: workflow (will it work...flow?)

* fix: format and lint
  • Loading branch information
dhdaines authored Jul 16, 2024
1 parent 33e1cfe commit ac4274a
Show file tree
Hide file tree
Showing 71 changed files with 20,905 additions and 480 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/analyse.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- name: Install
run: |
python3 -m pip install --upgrade pip
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
python3 -m pip install -e .
- name: Cache downloads
id: cache-downloads
Expand All @@ -49,7 +50,7 @@ jobs:
done
- name: Extract
run: |
alexi -v extract -m download/index.json download/*.pdf
alexi -v extract --model alexi/models/rnn_crf.pt -m download/index.json download/*.pdf
- name: Index
run: |
alexi -v index export
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ annotations.

Une fois satisfait du résultat, il suffira de copier `1314-page1.csv`
vers le repertoire `data` et réentrainer le modèle avec
`scripts/retrain.sh`.
`hatch run train`.

Extraction de catégories pertinentes du zonage
----------------------------------------------
Expand Down
124 changes: 96 additions & 28 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,3 @@
Tracking some time
------------------

- fix sous-section links 1h
- test case 15 min
- fuzzy matching of element numbers w/sequence 30 min
- deploy 15 min

- links to categories 1h
- test case 15 min
- collect cases and implement 30 min

- links to zones 2h
- implement zone query in ZONALDA (with centroid) 1h30
- extract zone links (multiple usually) 30min

- links to usages 1h30
- test case 15 min
- analysis function 45 min
- linking as above 30 min

Immediate fixes/enhancements
----------------------------

Expand All @@ -38,11 +17,102 @@ Immediate fixes/enhancements
DERP LERNING
------------

- Segmentation
- Retokenize CSVs using CamemBERT tokenizer (spread features on pieces)
- Train PyTorch-CRF: https://pytorch-crf.readthedocs.io/en/stable/
- possibly use Skorch to do evaluation: https://skorch.readthedocs.io/en/stable/
Segmentation
============

- Retokenize CSVs using CamemBERT tokenizer (spread features on pieces) DONE
- Train a BiLSTM model with vsl features DONE
- Learning rate decay and early stopping DONE
- Embed words and categorical features DONE
- Use same evaluator as CRF training for comparison DONE
- Scale layout features by page size and include as vector DONE
- Retrain from full dataset + patches
- early stopping? sample a dev set?
- Do extraction and qualitative evaluation
- sort for batch processing then unsort afterwards
- CRF output layer DONE
- Ensemble RNN DONE
- Viterbi decoding (with allowed transitions only) on RNN outputs DONE
- Could *possibly* train a CRF to do this, in fact DONE
- Tokenize from chars
- Add functionality to pdfplumber
- Use Transformers for embeddings
- Heuristic pre-chunking as described below
- Either tokenize from chars (above) or use first embedding per word
- Probably project 768 dimensions down to something smaller
- Do prediction with Transformers (LayoutLM) DONE
- heuristic chunking based on line gap (not indent) DONE
- Do prediction with Transformers (CamemBERT)
- Do prediction with Transformers (CamemBERT + vector feats)


Segmentation results
====================

- Things that helped
- RNN helps overall, particularly on unseen data (using the
"patches" as a test set)
- use all the manually created features and embed them with >=4 dimensions
- deltas and delta-deltas
- scale all the things by page size (slightly less good than by
abs(max(feats)) but probably more robust)
- upweight B- tags by 2.0
- weight all tags by inverse frequency (works even better than B- * 2.0)
- taking the best model using f1_macro (requires ensemble or dev set)
- ensemble of cross-validation folds (allows early stopping as well)
- in *theory* dropout would give us this benefit too but no
- Training CRF on top of pre-trained RNN
- Don't constrain transitions (see below)
- Do freeze all RNN parameters
- Can just do it for one epoch if you want (if not, save the RNN outputs...)
- Inconclusive
- GRU or plain RNN with lower learning rate
- LSTM is maybe overparameterized?
- Improves label accuracy quite a lot but mean F1 not really
- This seems to be a consequence of lower learning rate not cell typpe
- LayoutLM
- pretrained on wrong language
- layout features possibly suboptimal for this task
- but need to synchronize evaluation metrics to be sure!
- Things that did not help
- CamemBERT tokenizer doesn't work well for CRFs, possibly due to:
- all subwords have the same position, so layout features are wrong
- hand-crafted features maybe don't work the same on subwords (leading _ thing)
- weighting classes by inverse frequency (just upweight B as it's what we care about)
- more LSTM layers
- much wider LSTM
- much narrower LSTM
- dropout on LSTM layers
- extra feedforward layer
- dropout on extra feedforward layer
- wider word embeddings
- CRF output layer (trained end-to-end)
- Training is *much* slower
- Raw accuracy is consistently a bit better.
- Macro-F1 though is not as good (over B- tags)
- Imbalanced data is an issue and weighting is more difficult
- Definitely weight transitions and emissions (helps)
- Have to weight "up", can't weight "down"
- Weighting by exp(1.0 / count) better than nothing
- Weighting by exp(1.0 / B-count) not helpful
- Weighting by exp(1.0 / (B-count + I-count)) not helpful
- Applying Viterbi to RNN output shows why
- Sequence constraints favour accuracy of I over B
- Weighted RNN training favours correct Bs, but Is can change
mid-sequence, which we don't care about
- Decoding with constraints forces B and I to agree, improving
overall acccuracy by fixing incorrect Is but flipping some
correct Bs in the process
- Confirmed, Viterbi with --labels bonly gives (nearly) same
results as non-Viterbi
- Training RNN-CRF with --labels bonly
- Not sure why since it does help for discrete CRF?!
- Things yet to be tried
- pre-trained or pre-computed word embeddings
- label smoothing
- feedforward layer before RNN
- dropout in other places

Documentation
-------------

Expand All @@ -69,9 +139,7 @@ Unprioritized future stuff
- have to hack the crfsuite file to do this
- not at all easy to do with sklearn-crfsuite magical pickling
- otherwise ... treat I-B as B-B when following O or I-A (as before)
- workflow for correcting individual pages
- convenience functions for "visual debugging" in pdfplumber style
- instructions to identify and extract CSV for page
- investigate using a different CRF library
- tune regularization (some more)
- compare memory footprint of main branch versus html_output
- levels of lists
Expand Down
10 changes: 7 additions & 3 deletions alexi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from .label import Identificateur
from .search import search
from .segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
from .segment import Segmenteur
from .segment import RNNSegmenteur, Segmenteur

LOGGER = logging.getLogger("alexi")
VERSION = "0.4.0"
Expand Down Expand Up @@ -59,7 +59,11 @@ def convert_main(args: argparse.Namespace):

def segment_main(args: argparse.Namespace):
"""Segmenter un CSV"""
crf = Segmenteur(args.model)
crf: Segmenteur
if args.model.suffix == ".pt":
crf = RNNSegmenteur(args.model)
else:
crf = Segmenteur(args.model)
reader = csv.DictReader(args.csv)
write_csv(crf(reader), sys.stdout)

Expand Down Expand Up @@ -140,7 +144,7 @@ def make_argparse() -> argparse.ArgumentParser:
"segment", help="Segmenter et étiquetter les segments d'un CSV"
)
segment.add_argument(
"--model", help="Modele CRF", type=Path, default=DEFAULT_SEGMENT_MODEL
"--model", help="Modele CRF ou RNN", type=Path, default=DEFAULT_SEGMENT_MODEL
)
segment.add_argument(
"csv",
Expand Down
13 changes: 8 additions & 5 deletions alexi/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@
import logging
import operator
from collections import deque
from io import BufferedReader, BytesIO
from pathlib import Path
from typing import Any, Iterable, Iterator, Optional, TextIO, Union
from typing import Any, Iterable, Iterator, Optional, TextIO

from pdfplumber import PDF
from pdfplumber.page import Page
Expand Down Expand Up @@ -123,15 +122,17 @@ def get_word_features(

class Converteur:
pdf: PDF
path: Path
tree: Optional[PDFStructTree]
y_tolerance: int

def __init__(
self,
path_or_fp: Union[str, Path, BufferedReader, BytesIO],
path: Path,
y_tolerance: int = 2,
):
self.pdf = PDF.open(path_or_fp)
self.pdf = PDF.open(path)
self.path = path
self.y_tolerance = y_tolerance
try:
# Get the tree for the *entire* document since elements
Expand Down Expand Up @@ -180,7 +181,9 @@ def extract_words(self, pages: Optional[Iterable[int]] = None) -> Iterator[T_obj
continue
if word["x1"] > page.width or word["bottom"] > page.height:
continue
yield get_word_features(word, page, chars, elmap)
feats = get_word_features(word, page, chars, elmap)
feats["path"] = str(self.path)
yield feats

def make_bloc(
self, el: PDFStructElement, page_number: int, mcids: Iterable[int]
Expand Down
11 changes: 8 additions & 3 deletions alexi/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from alexi.label import Identificateur
from alexi.link import Resolver
from alexi.segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
from alexi.segment import DEFAULT_MODEL_NOSTRUCT, Segmenteur
from alexi.segment import DEFAULT_MODEL_NOSTRUCT, RNNSegmenteur, Segmenteur
from alexi.types import T_obj

LOGGER = logging.getLogger("extract")
Expand All @@ -39,7 +39,7 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
help="Ne pas utiliser le CSV de référence",
action="store_true",
)
parser.add_argument("--segment-model", help="Modele CRF", type=Path)
parser.add_argument("--segment-model", help="Modele CRF/RNN", type=Path)
parser.add_argument(
"--label-model", help="Modele CRF", type=Path, default=DEFAULT_LABEL_MODEL
)
Expand Down Expand Up @@ -329,6 +329,8 @@ def make_doc_tree(docs: list[Document], outdir: Path) -> dict[str, dict[str, str


class Extracteur:
crf: Segmenteur

def __init__(
self,
outdir: Path,
Expand All @@ -340,7 +342,10 @@ def __init__(
self.outdir = outdir
self.crf_s = Identificateur()
if segment_model is not None:
self.crf = Segmenteur(segment_model)
if segment_model.suffix == ".pt":
self.crf = RNNSegmenteur(segment_model)
else:
self.crf = Segmenteur(segment_model)
self.crf_n = None
else:
self.crf = Segmenteur(DEFAULT_SEGMENT_MODEL)
Expand Down
2 changes: 2 additions & 0 deletions alexi/link.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ def __call__(
return self.resolve_internal(text, srcpath, doc)

def resolve_zonage(self, text: str, srcpath: str) -> Optional[str]:
if self.metadata.get("zonage") is None:
return None
m = MILIEU_RE.search(text)
if m is None:
return None
Expand Down
Binary file modified alexi/models/crfseq.joblib.gz
Binary file not shown.
Loading

0 comments on commit ac4274a

Please sign in to comment.