Training datasets for training GROBID sale catalogues models

Each directory of this repository contains datasets created to train GROBID sale catalogues models. Datasets are divided based on where original documents are being kept, and then are organized by authors/auction houses.

Annotated files are in the TEI-XML format.

Naming convention

BnF files are named with their Gallica ark identifier.
INHA files are named with their digital identifier ("identifiant numérique") provided in their online notice.

GROBID models

Segmentation : the segmentation model aims to obtain a high level segmentation of the catalogues.

Data quality

Before being pushed to the main branch, annotated files have at least been proofread once, and are validated with an XSD by a Github action.

Toolbox

This repository also contains a set of tools that can be used on the training sets.

PDF Preprocessing
Quality assessment
XML validity checker (used by a Github action)

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
datasets		datasets
toolbox		toolbox
.gitignore		.gitignore
CITATION.CFF		CITATION.CFF
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training datasets for training GROBID sale catalogues models

Naming convention

GROBID models

Data quality

Toolbox

About

Languages

License

DataCatalogue/grobid-datacat-TrainingData

Folders and files

Latest commit

History

Repository files navigation

Training datasets for training GROBID sale catalogues models

Naming convention

GROBID models

Data quality

Toolbox

About

Topics

Resources

License

Stars

Watchers

Forks

Languages