ScienceBeam Trainer Tools for GROBID

⚠️ Under new stewardship

eLife have handed over stewardship of ScienceBeam to The Coko Foundation. You can now find the updated code repository at https://gitlab.coko.foundation/sciencebeam/sciencebeam-trainer-grobid-tools and continue the conversation on Coko's Mattermost chat server: https://mattermost.coko.foundation/

For more information on why we're doing this read our latest update on our new technology direction: https://elifesciences.org/inside-elife/daf1b699/elife-latest-announcing-a-new-technology-direction

Overview

Whereas sciencebeam-trainer-grobid is a lightweight wrapper around GROBID, intended to be used as Docker container. This project provides additional tools that can be used to prepare the data for GROBID and complete the process after training (e.g. build a new Docker container with the trained model).

The intention is to use cloud storage as the storage between the steps. But one could also just use a data volume.

Prerequisites

Docker and Docker Compose

Development

Example End-to-End

make example-data-processing-end-to-end

Uses a sample dataset and trains a GROBID model with it.

Note: the sample dataset is currently not public (but the intention is to provide a public dataset in the future)

Get Example Data

make get-example-data

Downloads and prepares a sample dataset to the data Docker volume.

Note: see above regarding dataset not being public at the moment.

Generate GROBID Training Data

make generate-grobid-training-data

Converts the previously downloaded PDF from the Data volume to GROBID training data. The tei files will be stored in tei-raw in the dataset. Training on the raw XML wouldn't be of much use as that the annotations the model already knows. Usually one would review and correct those generated XML files using the annotation guidelines. The final tei files should be stored in the tei sub directory of the corpus in the dataset. In our case we will be using auto-annotation using JATS XML.

Upload Dataset (optional)

make upload-dataset

Uploads the local dataset to the cloud. This allows separating the individual steps.

Auto-annotate Header

make auto-annotate-header

Auto-annotates the tei-raw (produced by the generate-grobid-training-data) in combination with the JATS XML. The result is stored in tei-auto.

Copy Raw Header Training Data to TEI

make copy-auto-annotate-header-training-data-to-tei

This copies the generated raw tei XML files in tei-auto to tei. Alternatively you could review the generated tei-auto before copying them over.

Train Header Model with Dataset

make train-header-model

Trains the model over the dataset produced using the previous steps. The output will be the trained GROBID Header Model.

Upload Header Model

make CLOUD_MODELS_PATH=gs://bucket/path/to/model upload-header-model

Upload the final header model to a location in the cloud. This is assuming that the credentials are mounted to the container. Because the Google Gloud SDK also has some support for AWS' S3, you could also specify an S3 location.

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.github		.github
ci		ci
config		config
docker/grobid-with-trained-model		docker/grobid-with-trained-model
sciencebeam_trainer_grobid_tools		sciencebeam_trainer_grobid_tools
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
Dockerfile.tools		Dockerfile.tools
Dockerfile.tools-dev		Dockerfile.tools-dev
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
auto-annotate-header.sh		auto-annotate-header.sh
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
maintainers.txt		maintainers.txt
pytest.ini		pytest.ini
requirements.build.txt		requirements.build.txt
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScienceBeam Trainer Tools for GROBID

⚠️ Under new stewardship

Overview

Prerequisites

Recommended

Development

Example End-to-End

Get Example Data

Generate GROBID Training Data

Upload Dataset (optional)

Auto-annotate Header

Copy Raw Header Training Data to TEI

Train Header Model with Dataset

Upload Header Model

About

Releases 18

Packages

Contributors 4

Languages

License

elifesciences/sciencebeam-trainer-grobid-tools

Folders and files

Latest commit

History

Repository files navigation

ScienceBeam Trainer Tools for GROBID

⚠️ Under new stewardship

Overview

Prerequisites

Recommended

Development

Example End-to-End

Get Example Data

Generate GROBID Training Data

Upload Dataset (optional)

Auto-annotate Header

Copy Raw Header Training Data to TEI

Train Header Model with Dataset

Upload Header Model

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Packages 0

Contributors 4

Languages

Packages