Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilser de l'apprentissage profond .. wouhou #28

Merged
merged 90 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
a3c52ce
chore: update baselines
dhdaines Jun 26, 2024
42a0f71
refactor: rendre filter_tab reutilisable
dhdaines Jun 26, 2024
056bdb6
refactor: remove useless results
dhdaines Jun 27, 2024
eca4df8
docs: update todo
dhdaines Jun 27, 2024
5d715d8
feat: prendre en charge tokenisation
dhdaines Jun 27, 2024
d88b7a1
feat: RNN training
dhdaines Jun 28, 2024
e825586
feat: same eval as train_crf, results not too bad
dhdaines Jun 28, 2024
f9775bc
refactor: reorganize 1
dhdaines Jun 28, 2024
1b1222f
fix: do cross validation properly
dhdaines Jun 28, 2024
6fce712
feat: add (scaled) vector features
dhdaines Jul 2, 2024
e66c0a6
fix: slightly better scaling
dhdaines Jul 2, 2024
d061594
feat: more features
dhdaines Jul 2, 2024
a8f9b82
docs: record of scores
dhdaines Jul 2, 2024
957a9f5
fix: featnames redundant
dhdaines Jul 2, 2024
6094af7
feat: use the knowledge based features too
dhdaines Jul 3, 2024
7bf8370
fix: use 4 dimensions
dhdaines Jul 3, 2024
9f23f46
feat: embed things and report
dhdaines Jul 3, 2024
9ae6742
feat: take best model
dhdaines Jul 3, 2024
32dcbc2
fix: scale by page height for robust
dhdaines Jul 3, 2024
977e09a
feat: 64
dhdaines Jul 3, 2024
dc3679c
feat: weight up B
dhdaines Jul 3, 2024
163cba1
fix: smaller
dhdaines Jul 3, 2024
59464ad
fix: same width for all embeddings is best
dhdaines Jul 3, 2024
da4a603
docs: some results
dhdaines Jul 3, 2024
2c9cafd
docs: more todo
dhdaines Jul 3, 2024
7607b4b
docs: todo
dhdaines Jul 3, 2024
98de1a2
refactor: refactor
dhdaines Jul 3, 2024
20f7a48
feat: CRF output layer, working finally!
dhdaines Jul 3, 2024
1beb864
fix: horrible python error
dhdaines Jul 3, 2024
615ce0a
fix: lower lr
dhdaines Jul 3, 2024
0ed5d58
feat: more parameters
dhdaines Jul 3, 2024
b3db683
docs: scores
dhdaines Jul 4, 2024
46fdd5c
docs: more todos
dhdaines Jul 6, 2024
2ea590a
refactor: move CSV without PDF to a subdirectory
dhdaines Jul 7, 2024
bdece2c
chore: retrain
dhdaines Jul 7, 2024
e4f75a0
feat: derp lerning for the win!
dhdaines Jul 7, 2024
d9b5b10
feat: refactor out rnn/rnncrf stuff and add dimensions
dhdaines Jul 7, 2024
0ca9a71
feat: parameters (and best results of search)
dhdaines Jul 8, 2024
aacf31d
docs: tdo
dhdaines Jul 8, 2024
e6a70c3
feat: better updates
dhdaines Jul 8, 2024
7b8bede
feat: data processing for LayoutLM
dhdaines Jul 8, 2024
27a39e1
fix: Figure a pas daffaire la
dhdaines Jul 8, 2024
b9cb436
fix: normalize box
dhdaines Jul 8, 2024
e9d69f9
fix: ensure box
dhdaines Jul 8, 2024
e57c4d4
fix: remove bogus line
dhdaines Jul 9, 2024
73e4a13
fix: make repeatable so it does not seem like random crashing!
dhdaines Jul 9, 2024
ee1f2f5
Merge remote-tracking branch 'origin/more_derp_lerning' into more_der…
dhdaines Jul 9, 2024
b404332
fix: fix some errors
dhdaines Jul 9, 2024
012aa57
fix(test): fix test
dhdaines Jul 9, 2024
35221fa
fix: dropout not useful
dhdaines Jul 9, 2024
bcf5591
chore: update scores
dhdaines Jul 9, 2024
168177c
feat: activer mode test pour rnn
dhdaines Jul 9, 2024
5066606
docs: todo
dhdaines Jul 9, 2024
9f07412
fix: poutyne removal
dhdaines Jul 9, 2024
d93d469
feat: equivalence RNN/CRF pour segmentation
dhdaines Jul 9, 2024
ff48c8b
fix: si y a pas de zonage
dhdaines Jul 9, 2024
c75ab0c
fix: no need for batch in predict
dhdaines Jul 9, 2024
d1e0ee2
feat: do not early stop by default
dhdaines Jul 9, 2024
8a1c6ee
feat: cross validation
dhdaines Jul 9, 2024
519e7da
docs: cross validation layoutlm
dhdaines Jul 9, 2024
c8b13d5
feat: try to weight
dhdaines Jul 10, 2024
cbc0a0a
feat: weight labels
dhdaines Jul 10, 2024
3a8539c
chore: updates
dhdaines Jul 10, 2024
8329e8b
feat: support bonly, tonly, iobonly
dhdaines Jul 10, 2024
2bffab3
feat: use tonly
dhdaines Jul 10, 2024
5346c47
docs: various only results
dhdaines Jul 10, 2024
10852f9
docs: more scores
dhdaines Jul 10, 2024
921b85d
Merge branch 'main' into more_derp_lerning
dhdaines Jul 10, 2024
774fbc2
Revert "fix: Figure a pas daffaire la"
dhdaines Jul 10, 2024
820da70
refactor: patches to patches
dhdaines Jul 10, 2024
c7b7a10
fix: better crf-rnn with allennlp
dhdaines Jul 11, 2024
65d20ae
feat: try label weights
dhdaines Jul 11, 2024
ac9a8ac
feat: make train_rnn_crf work with acc/f1 and weights
dhdaines Jul 12, 2024
f0c1ba6
feat: synchroniser train_rnn_crf avec train_rnn
dhdaines Jul 12, 2024
cd8de0f
feat: standardize rnn and rnncrf scripts
dhdaines Jul 15, 2024
1460096
feat: test with majority vote rnn
dhdaines Jul 15, 2024
144a8fa
fix: label weights are exponential it seems
dhdaines Jul 15, 2024
0c9da54
fix: weight transitions too (it is better)
dhdaines Jul 15, 2024
48ea479
docs: minor updates
dhdaines Jul 15, 2024
c0558d4
docs: scores for best rnn-crf
dhdaines Jul 15, 2024
d1b63a8
fix: no need for separate test_rnn_crf
dhdaines Jul 15, 2024
b8a75eb
feat: add voting for CRF
dhdaines Jul 15, 2024
947c350
feat: reuse RNN code
dhdaines Jul 15, 2024
66accf3
feat: enable --labels bonly and decoding
dhdaines Jul 16, 2024
ce8ca79
feat: initialize RNN-CRF from RNN (helps a lot)
dhdaines Jul 16, 2024
fe126a7
feat: train and support RNN and RNN-CRF
dhdaines Jul 16, 2024
2f47f04
feat: add rnn+crf training
dhdaines Jul 16, 2024
c81e9f4
chore: retrain
dhdaines Jul 16, 2024
5cf98a1
feat: workflow (will it work...flow?)
dhdaines Jul 16, 2024
25e789e
fix: format and lint
dhdaines Jul 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/analyse.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- name: Install
run: |
python3 -m pip install --upgrade pip
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
python3 -m pip install -e .
- name: Cache downloads
id: cache-downloads
Expand All @@ -49,7 +50,7 @@ jobs:
done
- name: Extract
run: |
alexi -v extract -m download/index.json download/*.pdf
alexi -v extract --model alexi/models/rnn_crf.pt -m download/index.json download/*.pdf
- name: Index
run: |
alexi -v index export
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ annotations.

Une fois satisfait du résultat, il suffira de copier `1314-page1.csv`
vers le repertoire `data` et réentrainer le modèle avec
`scripts/retrain.sh`.
`hatch run train`.

Extraction de catégories pertinentes du zonage
----------------------------------------------
Expand Down
124 changes: 96 additions & 28 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,3 @@
Tracking some time
------------------

- fix sous-section links 1h
- test case 15 min
- fuzzy matching of element numbers w/sequence 30 min
- deploy 15 min

- links to categories 1h
- test case 15 min
- collect cases and implement 30 min

- links to zones 2h
- implement zone query in ZONALDA (with centroid) 1h30
- extract zone links (multiple usually) 30min

- links to usages 1h30
- test case 15 min
- analysis function 45 min
- linking as above 30 min

Immediate fixes/enhancements
----------------------------

Expand All @@ -38,11 +17,102 @@ Immediate fixes/enhancements
DERP LERNING
------------

- Segmentation
- Retokenize CSVs using CamemBERT tokenizer (spread features on pieces)
- Train PyTorch-CRF: https://pytorch-crf.readthedocs.io/en/stable/
- possibly use Skorch to do evaluation: https://skorch.readthedocs.io/en/stable/
Segmentation
============

- Retokenize CSVs using CamemBERT tokenizer (spread features on pieces) DONE
- Train a BiLSTM model with vsl features DONE
- Learning rate decay and early stopping DONE
- Embed words and categorical features DONE
- Use same evaluator as CRF training for comparison DONE
- Scale layout features by page size and include as vector DONE
- Retrain from full dataset + patches
- early stopping? sample a dev set?
- Do extraction and qualitative evaluation
- sort for batch processing then unsort afterwards
- CRF output layer DONE
- Ensemble RNN DONE
- Viterbi decoding (with allowed transitions only) on RNN outputs DONE
- Could *possibly* train a CRF to do this, in fact DONE
- Tokenize from chars
- Add functionality to pdfplumber
- Use Transformers for embeddings
- Heuristic pre-chunking as described below
- Either tokenize from chars (above) or use first embedding per word
- Probably project 768 dimensions down to something smaller
- Do prediction with Transformers (LayoutLM) DONE
- heuristic chunking based on line gap (not indent) DONE
- Do prediction with Transformers (CamemBERT)
- Do prediction with Transformers (CamemBERT + vector feats)


Segmentation results
====================

- Things that helped
- RNN helps overall, particularly on unseen data (using the
"patches" as a test set)
- use all the manually created features and embed them with >=4 dimensions
- deltas and delta-deltas
- scale all the things by page size (slightly less good than by
abs(max(feats)) but probably more robust)
- upweight B- tags by 2.0
- weight all tags by inverse frequency (works even better than B- * 2.0)
- taking the best model using f1_macro (requires ensemble or dev set)
- ensemble of cross-validation folds (allows early stopping as well)
- in *theory* dropout would give us this benefit too but no
- Training CRF on top of pre-trained RNN
- Don't constrain transitions (see below)
- Do freeze all RNN parameters
- Can just do it for one epoch if you want (if not, save the RNN outputs...)
- Inconclusive
- GRU or plain RNN with lower learning rate
- LSTM is maybe overparameterized?
- Improves label accuracy quite a lot but mean F1 not really
- This seems to be a consequence of lower learning rate not cell typpe
- LayoutLM
- pretrained on wrong language
- layout features possibly suboptimal for this task
- but need to synchronize evaluation metrics to be sure!
- Things that did not help
- CamemBERT tokenizer doesn't work well for CRFs, possibly due to:
- all subwords have the same position, so layout features are wrong
- hand-crafted features maybe don't work the same on subwords (leading _ thing)
- weighting classes by inverse frequency (just upweight B as it's what we care about)
- more LSTM layers
- much wider LSTM
- much narrower LSTM
- dropout on LSTM layers
- extra feedforward layer
- dropout on extra feedforward layer
- wider word embeddings
- CRF output layer (trained end-to-end)
- Training is *much* slower
- Raw accuracy is consistently a bit better.
- Macro-F1 though is not as good (over B- tags)
- Imbalanced data is an issue and weighting is more difficult
- Definitely weight transitions and emissions (helps)
- Have to weight "up", can't weight "down"
- Weighting by exp(1.0 / count) better than nothing
- Weighting by exp(1.0 / B-count) not helpful
- Weighting by exp(1.0 / (B-count + I-count)) not helpful
- Applying Viterbi to RNN output shows why
- Sequence constraints favour accuracy of I over B
- Weighted RNN training favours correct Bs, but Is can change
mid-sequence, which we don't care about
- Decoding with constraints forces B and I to agree, improving
overall acccuracy by fixing incorrect Is but flipping some
correct Bs in the process
- Confirmed, Viterbi with --labels bonly gives (nearly) same
results as non-Viterbi
- Training RNN-CRF with --labels bonly
- Not sure why since it does help for discrete CRF?!
- Things yet to be tried
- pre-trained or pre-computed word embeddings
- label smoothing
- feedforward layer before RNN
- dropout in other places

Documentation
-------------

Expand All @@ -69,9 +139,7 @@ Unprioritized future stuff
- have to hack the crfsuite file to do this
- not at all easy to do with sklearn-crfsuite magical pickling
- otherwise ... treat I-B as B-B when following O or I-A (as before)
- workflow for correcting individual pages
- convenience functions for "visual debugging" in pdfplumber style
- instructions to identify and extract CSV for page
- investigate using a different CRF library
- tune regularization (some more)
- compare memory footprint of main branch versus html_output
- levels of lists
Expand Down
10 changes: 7 additions & 3 deletions alexi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from .label import Identificateur
from .search import search
from .segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
from .segment import Segmenteur
from .segment import RNNSegmenteur, Segmenteur

LOGGER = logging.getLogger("alexi")
VERSION = "0.4.0"
Expand Down Expand Up @@ -59,7 +59,11 @@ def convert_main(args: argparse.Namespace):

def segment_main(args: argparse.Namespace):
"""Segmenter un CSV"""
crf = Segmenteur(args.model)
crf: Segmenteur
if args.model.suffix == ".pt":
crf = RNNSegmenteur(args.model)
else:
crf = Segmenteur(args.model)
reader = csv.DictReader(args.csv)
write_csv(crf(reader), sys.stdout)

Expand Down Expand Up @@ -140,7 +144,7 @@ def make_argparse() -> argparse.ArgumentParser:
"segment", help="Segmenter et étiquetter les segments d'un CSV"
)
segment.add_argument(
"--model", help="Modele CRF", type=Path, default=DEFAULT_SEGMENT_MODEL
"--model", help="Modele CRF ou RNN", type=Path, default=DEFAULT_SEGMENT_MODEL
)
segment.add_argument(
"csv",
Expand Down
13 changes: 8 additions & 5 deletions alexi/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@
import logging
import operator
from collections import deque
from io import BufferedReader, BytesIO
from pathlib import Path
from typing import Any, Iterable, Iterator, Optional, TextIO, Union
from typing import Any, Iterable, Iterator, Optional, TextIO

from pdfplumber import PDF
from pdfplumber.page import Page
Expand Down Expand Up @@ -123,15 +122,17 @@ def get_word_features(

class Converteur:
pdf: PDF
path: Path
tree: Optional[PDFStructTree]
y_tolerance: int

def __init__(
self,
path_or_fp: Union[str, Path, BufferedReader, BytesIO],
path: Path,
y_tolerance: int = 2,
):
self.pdf = PDF.open(path_or_fp)
self.pdf = PDF.open(path)
self.path = path
self.y_tolerance = y_tolerance
try:
# Get the tree for the *entire* document since elements
Expand Down Expand Up @@ -180,7 +181,9 @@ def extract_words(self, pages: Optional[Iterable[int]] = None) -> Iterator[T_obj
continue
if word["x1"] > page.width or word["bottom"] > page.height:
continue
yield get_word_features(word, page, chars, elmap)
feats = get_word_features(word, page, chars, elmap)
feats["path"] = str(self.path)
yield feats

def make_bloc(
self, el: PDFStructElement, page_number: int, mcids: Iterable[int]
Expand Down
11 changes: 8 additions & 3 deletions alexi/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from alexi.label import Identificateur
from alexi.link import Resolver
from alexi.segment import DEFAULT_MODEL as DEFAULT_SEGMENT_MODEL
from alexi.segment import DEFAULT_MODEL_NOSTRUCT, Segmenteur
from alexi.segment import DEFAULT_MODEL_NOSTRUCT, RNNSegmenteur, Segmenteur
from alexi.types import T_obj

LOGGER = logging.getLogger("extract")
Expand All @@ -39,7 +39,7 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
help="Ne pas utiliser le CSV de référence",
action="store_true",
)
parser.add_argument("--segment-model", help="Modele CRF", type=Path)
parser.add_argument("--segment-model", help="Modele CRF/RNN", type=Path)
parser.add_argument(
"--label-model", help="Modele CRF", type=Path, default=DEFAULT_LABEL_MODEL
)
Expand Down Expand Up @@ -329,6 +329,8 @@ def make_doc_tree(docs: list[Document], outdir: Path) -> dict[str, dict[str, str


class Extracteur:
crf: Segmenteur

def __init__(
self,
outdir: Path,
Expand All @@ -340,7 +342,10 @@ def __init__(
self.outdir = outdir
self.crf_s = Identificateur()
if segment_model is not None:
self.crf = Segmenteur(segment_model)
if segment_model.suffix == ".pt":
self.crf = RNNSegmenteur(segment_model)
else:
self.crf = Segmenteur(segment_model)
self.crf_n = None
else:
self.crf = Segmenteur(DEFAULT_SEGMENT_MODEL)
Expand Down
2 changes: 2 additions & 0 deletions alexi/link.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ def __call__(
return self.resolve_internal(text, srcpath, doc)

def resolve_zonage(self, text: str, srcpath: str) -> Optional[str]:
if self.metadata.get("zonage") is None:
return None
m = MILIEU_RE.search(text)
if m is None:
return None
Expand Down
Binary file modified alexi/models/crfseq.joblib.gz
Binary file not shown.
Loading
Loading