Visual FUDGE:

Form Understanding via Dynamic Graph Editing

This is the code for our ICDAR 2021 paper "Visual FUDGE: Form Understanding via Dynamic Graph Editing" (http://arxiv.org/abs/2105.08194)

Video: https://youtu.be/dUZvm8MP-58

This code is licensed under GNU GPL v3. If you would like it distributed to you under a different license, please contact me (briandavis@byu.net).

Requirements

Python 3
PyTorch 1.7+
scikit-image
pytorch-geometric https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

Pre-trained Model weights

See pretrained.tar.gz in "Releases"

Reproducability instructions

Getting the datasets

NAF see https://github.com/herobd/NAF_dataset

FUNSD see https://guillaumejaume.github.io/FUNSD/

The configs expect the datasets to be at ../data/NAF_dataset/ and ../data/FUNSD/

Pretraining the detector networks

FUNSD: python train.py -c configs/cf_FUNSDLines_detect_augR_staggerLighter.json

NAF: python train.py -c configs/cf_NAF_detect_augR_staggerLighter.json

Training the full networks

FUNSD: python train.py -c configs/cf_FUNSDLines_pair_graph663rv_new.json

Word-FUDGE: python train.py -c configs/cf_FUNSDLinesAndWords_pair_graph663rv_new.json

NAF: python train.py -c configs/cf_NAF_pair_graph663rv_new.json

The ablation uses the following configs:

cf_FUNSDLines_pair_binary333rv_new.json
cf_FUNSDLines_pair_graph9rv_ablate.json
cf_FUNSDLines_pair_graph77rv_ablate.json
cf_FUNSDLines_pair_graph222rv_ablate.json
cf_NAF_pair_binary333rv_new.json

Wait, how long does this take to train?

If trained to the full 700,000 iterations, it takes a couple weeks, depending on your GPU. I used a batch size of 1 due to hardware limitations. I also hard-coded the batch size of 1, so you have to as well (GCNs handle batches specially and I didn't want to code that up).

However, from an experiment I ran, I think you can get the same results with only 250,000 iterations by accumulating the gradient to pretend a batch size of 5. This is done by adding "accum_grad_steps": 5 to trainer in the config json. Yes, that means it only updates the weights 50,000 times. It never hurts to train a bit more, it doesn't overfit in my experience.

Evaluating

If you want to run on GPU, add -g #, where # is the GPU number.

Remove the -T flag to run on the validation set.

Generally (works for detection and full model): python eval.py -c path/to/checkpoint.pth -T

Word-FUDGE needs to be told to evaluate using the GT word boxes: python eval.py -c path/to/checkpoint.pth -T -a useDetect=word_bbs

For the ablation using line-of-sight proposal: python eval.py -c path/to/checkpoint.pth -T -a model=change_relationship_proposal=line_of_sight

For the ablation preventing merges: python eval.py -c path/to/checkpoint.pth -T -a model=graph_config=0=merge_thresh=1.1,model=graph_config=1=merge_thresh=1.1,model=graph_config=2=merge_thresh=1.1

To compare to DocStruct -a gtGroups=1,useDetect=1 is used.

Usage

train.py

This is the script that executes training based on a configuration file. The training code is found in trainer/. The config file specifies which trainer is used.

The usage is: python train.py -c CONFIG.json (see below for example config file)

A training session can be resumed with: python train.py -r CHECKPOINT.pth

If you want to override the config file on a resume, just use the -c flag and be sure it has "override": true

eval.py

This script runs a trained model (from a snapshot) through the dataset and prints its scores. It is also used to save images with the predictions on them.

Usage: `python eval.py -c CHECKPOINT.pth -f OVERRIDE_CONFIG.pth -g (gpu number) -n (number of images to save) -d (directory to save images) -T

The only flags required is -c or -f.

If -T is ommited it will run on the validation set instead of the test set.

If you want it to generate images (like in the paper), use both the -d and -n flags.

There is an additional -a flag which allows overwriting of specific values of the config file using this format key1=nestedkey=value,key2=value. It also allows setting these special options (which are part of config):

Evaluating detector:

-a pretty=true: Makes printed picture cleaner (less details)

Evaluatring pairing:

-a useDetect=True|word_bbs: Whether to use GT detection line boxes or word boxes
-a gtGroups=True: Force use of GT groupings (for DocStruct comparison)
-a draw_verbosity=0-3: Different ways of displaying the results.

run.py

This will run a model on a single input image and them produce an annotated image.

Usage: python run.py input/image.png output.png -c path/to/checkpoint.pth

If running the NAF model, you'll also want to include the argument --scale-image 0.52 to resize the image appropriately. If you run a detection model, add the -d flag (note: it won't perform non-maximal suppression when doing this).

File Structure

This code is based on based on victoresque's pytorch template.


│
├── train.py - Training script
├── eval.py - Evaluation and display script
│
├── configs/ - where the config files are
│
├── base/ - abstract base classes
│   ├── base_data_loader.py - abstract base class for data loaders
│   ├── base_model.py - abstract base class for models
│   └── base_trainer.py - abstract base class for trainers
│
├── data_loader/ - 
│   └── data_loaders.py - This provides access to all the dataset objects
│
├── datasets/ - default datasets folder
│   ├── box_detect.py - base class for detection datasets
│   ├── forms_box_detect.py - detection for NAF dataset
│   ├── funsd_box_detect.py - detection for FUNSD dataset
│   ├── graph_pair.py - base class for pairing datasets
│   ├── forms_graph_pair.py - pairing for NAF dataset
│   ├── funsd_graph_pair.py - pairing for NAF dataset
│   └── test*.py - scripts to test the datasets and display the images for visual inspection
│
├── logger/ - for training process logging
│   └── logger.py
│
├── model/ - models, losses, and metrics
│   ├── binary_pair_real.py - Provides classifying network for pairing and final prediction network for detection. Also can have secondary using non-visual features only classifier
│   ├── coordconv.py - Implements a few variations of CoordConv. I didn't get better results using it.
│   ├── csrc/ - Contains Facebook's implementation for ROIAlign from https://github.com/facebookresearch/maskrcnn-benchmark
│   ├── roi_align.py - End point for ROIAlign code
│   ├── loss.py - Imports all loss functions
│   ├── net_builder.py - Defines basic layers and interpets config syntax into networks.
│   ├── optimize.py - pairing descision optimization code
│   ├── pairing_graph.py - pairing network class
│   ├── simpleNN.py - defines non-convolutional network
│   ├── yolo_box_detector.py - detector network class
│   └── yolo_loss.py - loss used by detector
│
├── saved/ - default checkpoints folder
│
├── trainer/ - trainers
│   ├── box_detect_trainer.py - detector training code
│   └── graph_pair_trainer.py - pairing training code
│
├── evaluators/ - used to evaluate the models
│   ├── draw_graph.py - draws the predictions onto the image
│   ├── funsdboxdetect_eval.py - for detectors
│   └── funsdgraphpair_eval.py - for pairing networks
│
└── utils/
    ├── util.py
    ├── augmentation.py - coloring and contrans augmentation
    ├── crop_transform.py - handles random croping, especially tracking which text is cropped
    ├── forms_annotations.py - functions for processing NAF dataset
    ├── funsd_annotations.py - functions for processing FUNSD dataset
    ├── group_pairing.py - helper functions dealing with groupings
    ├── img_f.py - I originally used OpenCV, but was running into issues with the Anaconda installation. This wraps SciKit Image with OpenCV function signatures.
    └── yolo_tools.py - Non-maximal supression and pred-to-GT aligning functions

Config file format

Config files are in .json format. Example:

{
  "name": "pairing",                      # Checkpoints will be saved in saved/name/checkpoint-...pth
  "cuda": true,                           # Whether to use GPU
  "gpu": 0,                               # GPU number. Only single GPU supported.
  "save_mode": "state_dict",              # Whether to save/load just state_dict, or whole object in checkpoint
  "override": true,                       # Override a checkpoints config
  "super_computer":false,                 # Whether to mute training info printed
  "data_loader": {
      "data_set_name": "FormsGraphPair",  # Class of dataset
      "special_dataset": "simple",        # Use partial dataset. "simple" is the set used for pairing in the paper
      "data_dir": "../data/NAF_dataset",  # Directory of dataset
      "batch_size": 1,
      "shuffle": true,
      "num_workers": 1,
      "crop_to_page":false,
      "color":false,
      "rescale_range": [0.4,0.65],        # Form images are randomly resized in this range
      "crop_params": {
          "crop_size":[652,1608],         # Crop size for training instance
      "pad":0
      },
      "no_blanks": true,                  # Removed fields that are blank
      "swap_circle":true,                 # Treat text that should be circled/crossed-out as pre-printed text
      "no_graphics":true,                 # Images not considered elements
      "cache_resized_images": true,       # Cache images at maximum size of rescale_range to make reading them faster
      "rotation": false,                  # Bounding boxes are converted to axis-aligned rectangles
      "only_opposite_pairs": true         # Only label-value pairs


  },
  "validation": {                         # Enherits all values from data_loader, specified values are changed
      "shuffle": false,
      "rescale_range": [0.52,0.52],
      "crop_params": null,
      "batch_size": 1
  },

  
  "lr_scheduler_type": "none",

  "optimizer_type": "Adam",
  "optimizer": {                          # Any parameters of the optimizer object go here
      "lr": 0.001,
      "weight_decay": 0
  },
  "loss": {                               # Name of functions (in loss.py) for various components
      "box": "YoloLoss",                  # Detection loss
      "edge": "sigmoid_BCE_loss",         # Pairing loss
      "nn": "MSE",                        # Num neighbor loss
      "class": "sigmoid_BCE_loss"         # Class of detections loss
  },
  "loss_weights": {                       # Respective weighting of losses (multiplier)
      "box": 1.0,
      "edge": 0.5,
      "nn": 0.25,
      "class": 0.25
  },
  "loss_params": 
      {
          "box": {"ignore_thresh": 0.5,
                  "bad_conf_weight": 20.0,
                  "multiclass":true}
      },
  "metrics": [],
  "trainer": {
      "class": "GraphPairTrainer",        # Training class name 
      "iterations": 125000,               # Stop iteration
      "save_dir": "saved/",               # save directory
      "val_step": 5000,                   # Run validation set every X iterations
      "save_step": 25000,                 # Save distinct checkpoint every X iterations
      "save_step_minor": 250,             # Save 'latest' checkpoint (overwrites) every X iterations
      "log_step": 250,                    # Print training metrics every X iterations
      "verbosity": 1,
      "monitor": "loss",
      "monitor_mode": "none",
      "warmup_steps": 1000,               # Defines length of ramp up from 0 learning rate
      "conf_thresh_init": 0.5,            
      "conf_thresh_change_iters": 0,      # Allows slowly lowering of detection conf thresh from higher value
      "retry_count":1,

      "unfreeze_detector": 2000,          # Iteration to unfreeze detector network
      "partial_from_gt": 0,               # Iteration to start using detection predictions
      "stop_from_gt": 20000,              # When to maximize predicted detection use
      "max_use_pred": 0.5,                # Maximum predicted detection use
      "use_all_bb_pred_for_rel_loss": true,

      "use_learning_schedule": true,
      "adapt_lr": false
  },
  "arch": "PairingGraph",                 # Class name of model
  "model": {
      "detector_checkpoint": "saved/detector/checkpoint-iteration150000.pth",
      "conf_thresh": 0.5,
      "start_frozen": true,
  "use_rel_shape_feats": "corner",
      "use_detect_layer_feats": 16,       # Assumes this is from final level of detection network
      "use_2nd_detect_layer_feats": 0,    # Specify conv after pool
      "use_2nd_detect_scale_feats": 2,    # Scale (from pools)
      "use_2nd_detect_feats_size": 64,
      "use_fixed_masks": true,
      "no_grad_feats": true,

      "expand_rel_context": 150,          # How much to pad around relationship candidates before passing to conv layers
      "featurizer_start_h": 32,           # Size ROIPooling resizes relationship crops to
      "featurizer_start_w": 32,
      "featurizer_conv": ["sep128","M","sep128","sep128","M","sep256","sep256","M",238], # Network for featurizing relationship, see below for syntax
      "featurizer_fc": null,

      "pred_nn": true,                    # Predict a new num neighbors for detections
      "pred_class": false,                # Predict a new class for detections
      "expand_bb_context": 150,           # How much to pad around detections
      "featurizer_bb_start_h": 32,        # Size ROIPooling resizes detection crops to
      "featurizer_bb_start_w": 32,
      "bb_featurizer_conv": ["sep64","M","sep64","sep64","M","sep128","sep128","M",250], # Network for featurizing detections

      "graph_config": {
          "arch": "BinaryPairReal",
          "in_channels": 256,
          "layers": ["FC256","FC256"],    # Relationship classifier
          "rel_out": 1,                   # one output, probability of true relationship
          "layers_bb": ["FC256"]          # Detection predictor
          "bb_out": 1,                    # one output, num neighbors
      }
  }
}

Config network layer syntax:

[int]: Regular 3x3 convolution with specified output channels, normalization (if any), and ReLU
"ReLU"
"drop[float]"/"dropout[float]": Dropout, if no float amount is 0.5
"M"": Maxpool (2x2)
"R[int]": Residual block with specified output channels, two 3x3 convs with correct ReLU+norm ordering (expects non-acticated input)
"k[int]-[int]": conv, norm, relu. First int specifies kernel size, second specifier output channels.
"d[int]-[int]": dilated conv, norm, relu. First int specifies dilation, second specifier output channels.
"[h/v]d[int]-[int]": horizontal or vertical dilated conv, norm, relu (horizontal is 1x3 and vertical is 3x1 kernel). First int specifies dilation, second specifier output channels.
"sep[int]": Two conv,norm,relu blocks, the first is depthwise seperated, the second is (1x1). The int is the out channels
"cc[str]-k[int],d[int],[hd/vd]-[int]": CoordConv, str is type, k int is kernel size (default 3), d is dilation size (default 1), hd makes it horizontal (kernel is height 1), vd makes it vertical, final int is out channels
"FC[int]": Fully-connected layer with given output channels

The checkpoints will be saved in save_dir/name.

The config file is saved in the same folder. (as a reference only, the config is loaded from the checkpoint)

Note: checkpoints contain:

{
  'arch': arch,
  'iteration': iteration,
  'logger': self.train_logger,
  'state_dict': self.model.state_dict(),
  'swa_state_dict': self.swa_model.state_dict(),
  'optimizer': self.optimizer.state_dict(),
  'config': self.config
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual FUDGE:

Form Understanding via Dynamic Graph Editing

Requirements

Pre-trained Model weights

Reproducability instructions

Getting the datasets

Pretraining the detector networks

Training the full networks

Wait, how long does this take to train?

Evaluating

Usage

train.py

eval.py

run.py

File Structure

Config file format

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2,201 Commits
base		base
configs		configs
data_loader		data_loader
datasets		datasets
evaluators		evaluators
logger		logger
model		model
trainer		trainer
utils		utils
.flake8		.flake8
.gitignore		.gitignore
FUNSDLines_anchors_noRot_25.json		FUNSDLines_anchors_noRot_25.json
FUNSD_example.png		FUNSD_example.png
FUNSD_extract_lines.py		FUNSD_extract_lines.py
FUNSD_train_valid_test_split.json		FUNSD_train_valid_test_split.json
LICENSE		LICENSE
NAF_example.png		NAF_example.png
README.md		README.md
anchors_noRot_new_25.json		anchors_noRot_new_25.json
do_update_bn.py		do_update_bn.py
eval.py		eval.py
graph.py		graph.py
logo.png		logo.png
run.py		run.py
train.py		train.py

License

herobd/FUDGE

Folders and files

Latest commit

History

Repository files navigation

Visual FUDGE:

Form Understanding via Dynamic Graph Editing

Requirements

Pre-trained Model weights

Reproducability instructions

Getting the datasets

Pretraining the detector networks

Training the full networks

Wait, how long does this take to train?

Evaluating

Usage

train.py

eval.py

run.py

File Structure

Config file format

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages