Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Latest commit

 

History

History
179 lines (138 loc) · 6.62 KB

training.md

File metadata and controls

179 lines (138 loc) · 6.62 KB

ScienceBeam Parser Model Training

ScienceBeam Parser uses machine learning models to parse documents. Pre-trained models are provided and referenced by the configuration. In order to get performance for the particular domain, it may be necessary to train models with domain specific documents.

For the sequence model (delft etc) the general workflow looks like:

  • Generate training data
  • Annotate generated training data
  • Train model (using sciencebeam-trainer-delft)
  • Use and evaluate model:
    • Configure new model in sciencebeam-parser
    • Convert documents
    • Evaluate converted documents (using sciencebeam-judge)

Generate training data

The training data for the sequential models follows the GROBID training data format.

Note: all commands below also support input and output files in google cloud storage (using gs:// for the path url)

Generate tei training data for the sequence models

Currently training data will be generated for the following models:

  • segmentation
  • header
  • affiliation_address
  • fulltext
  • reference_segmenter
  • citation (references)
  • figure
  • table
  • name (author names for header and citations)
python -m sciencebeam_parser.training.cli.generate_data \
    --source-path="test-data/*.pdf" \
    --output-path="./data/generated-training-data"

Using the configured models to pre-annotate the training data:

python -m sciencebeam_parser.training.cli.generate_data \
    --use-model \
    --source-path="test-data/*.pdf" \
    --output-path="./data/generated-training-data"

Note: as the models are hierachical, the parent model needs to be used in order to generate data for the child model. For example the segmentation model will be required for the header model.

The output could also be organised into a folder structure by model and type of file:

python -m sciencebeam_parser.training.cli.generate_data \
    --use-model \
    --use-directory-structure \
    --source-path="test-data/*.pdf" \
    --output-path="./data/generated-training-data"

Additionally the --gzip argument can be passed in, resulting in gzip (.gz) compressed output files.

Annotating tei training data for the sequence models

After the tei training data has been generated, it should get reviewed and manually annotated. Alternatively it can auto-annotated using sciencebeam-trainer-grobid-tools.

Generate delft training data for the sequence models

From the annotated tei training data, we can generate the delft training data used for training.

It will do one of the following:

  • For models with only tei XML files (no layout feature), it will parse the tei and generate data using the data generator.
  • For models with additional layout data files, it will align the parsed tei with the layout data file and add the label to it.

Example command for segmentation model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="segmentation" \
    --tei-source-path="data/generated-training-data/segmentation/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/segmentation/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/segmentation/corpus/segmentation.data"

Example command for header model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="header" \
    --tei-source-path="data/generated-training-data/header/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/header/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/header/corpus/header.data"

Example command for fulltext model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="fulltext" \
    --tei-source-path="data/generated-training-data/fulltext/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/fulltext/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/fulltext/corpus/fulltext.data"

Example command for figure model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="figure" \
    --tei-source-path="data/generated-training-data/figure/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/figure/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/figure/corpus/figure.data"

Example command for table model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="table" \
    --tei-source-path="data/generated-training-data/table/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/table/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/table/corpus/table.data"

Example command for reference_segmenter model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="reference_segmenter" \
    --tei-source-path="data/generated-training-data/reference-segmenter/corpus/tei/*.tei.xml" \
    --raw-source-path="data/generated-training-data/reference-segmenter/corpus/raw/" \
    --delft-output-path="./data/generated-training-data/delft/reference-segmenter/corpus/reference-segmenter.data"

Example command for affiliation_address model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="affiliation_address" \
    --tei-source-path="data/generated-training-data/affiliation-address/corpus/*.tei.xml" \
    --delft-output-path \
    "./data/generated-training-data/delft/affiliation-address/corpus/affiliation-address.data"

Example command for name model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="name_header" \
    --tei-source-path="data/generated-training-data/name/header/corpus/*.tei.xml" \
    --delft-output-path \
    "./data/generated-training-data/delft/name/header/corpus/name.data"
python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="name_citation" \
    --tei-source-path="data/generated-training-data/name/citation/corpus/*.tei.xml" \
    --delft-output-path \
    "./data/generated-training-data/delft/name/citation/corpus/name.data"

Example command for citation model

python -m sciencebeam_parser.training.cli.generate_delft_data \
    --model-name="citation" \
    --tei-source-path="data/generated-training-data/citation/corpus/*.tei.xml" \
    --delft-output-path \
    "./data/generated-training-data/delft/citation/corpus/citation.data"