GitHub - sonar-idh/nerdl: Named Entity Recognition, Disambiguation and Linking of Digitized Newspapers for Historical Network Analysis

About

The workflow includes the following steps:

1. Access images of digitized newspapers from Zefys
2. Apply OCR to the images using the OCR-D framework
3. Transform the OCR output into TSV format
4. Recognize named entities in the OCRed text
5. Disambiguate and link entities to Wikidata-IDs
6. Manually inspect or edit the results in a browser
7. Transform the results for use in a graph db

Prerequisites

To install and test the workflow, the following prerequisites must be met.

required

Python3

zefys

You need either local or remote access to the digitised newspaper images from Zefys

local

mkdir zefys
mount -o ro,noload /zefys/archive /zeyfs

remote

Download images using the API

ocrd

Install OCR-D via ocrd-galley

git clone https://github.com/qurator-spk/ocrd-galley
cd ocrd-galley
./build

You can now use zdb2ocr to OCR digitised newspapers from Zefys based on their zdb-id (with any - removed) and date of issue yyyymmdd

zdb2ocr 27974534 19010712

page2tsv

Install page2tsv

git clone https://github.com/qurator-spk/page2tsv
cd page2tsv
pip install .

You can now use page2tsv to transform the PAGE-XML output of the OCR into a tab-separated-values (tsv) format

page2tsv SNP27974534-19010712-0-1-0-0.xml SNP27974534-19010712-0-1-0-0.tsv

If images are served via iiif, the OCR coordinates can be used to generate according image urls by also providing the --image-url

page2tsv SNP27974534-19010712-0-1-0-0.xml SNP27974534-19010712-0-1-0-0.tsv \
--image-url=https://content.staatsbibliothek-berlin.de/zefys/SNP27974534-19010712-0-1-0-0/full/full/0/default.jpg

⚠️ The following steps assume you have access to or setup local instances of

sbb_ner

Apply named entity recognition with sbb_ner

page2tsv SNP27974534-19010712-0-1-0-0.tsv --ner-rest-endpoint

sbb_ned

Apply named entity disambiguation and linking with sbb_ned

page2tsv SNP27974534-19010712-0-1-0-0.tsv --ned-rest-endpoint

neat

Use the browser-based neat to inspect, correct or annotate tsv files

git clone https://github.com/qurator-spk/neat
cd neat
firefox neat.html

trs

Install trs

git clone https://github.com/sonar-idh/Transformer

Follow the instructions provided

TSV documentation

Information provided by the tsv filename:

SNP{zdb-id}-{yyyymmdd}-{issue}-{page}-{article}-{version}.tsv

zdb-id (any - removed)
date of issue (yyyymmdd)
issue number (0 = morning issue, 1 = evening issue etc., default 0)
page/image number
article id (not used, default 0)
version number (not used, default 0)

Example: SNP27974534-19010712-0-1-0-0.tsv

Information provided in the tsv file columns:

iiif_url placeholder injected as a comment under the column headers
No. indicates the sentence position (≥1, 0 marks sentence boundaries)
TOKEN contains the token text (utf-8 encoded)
NE-TAG contains the surface entity label (BIO chunking)
NE-EMB contains the embedded entity label (BIO chunking)
ID contains the surface entity wikidata ID (ranked candidates are separated by |)
url_id is replaced with the iiif_url
left,top,width,height hold the token OCR coordinates as absolute pixel values

Example (see also example):

No.     TOKEN           NE-TAG  NE-EMB  ID              url_id  left    top     width   height 
# https://iiif.url
36 	bekannter 	O 	O 	-           	- 	157 	181 	643 	660
37 	Comédie 	B-ORG 	B-LOC 	Q61460498 	- 	197 	262 	643 	661
38 	françaiſe 	I-ORG 	I-LOC 	Q61460498 	- 	277 	345 	642 	661
39 	anvertraut 	O 	O 	-          	- 	359 	440 	644 	659

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
example		example
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Prerequisites

required

recommended

zefys

local

remote

ocrd

page2tsv

sbb_ner

sbb_ned

neat

trs

TSV documentation

About

Releases

Packages

sonar-idh/nerdl

Folders and files

Latest commit

History

Repository files navigation

About

Prerequisites

required

recommended

zefys

local

remote

ocrd

page2tsv

sbb_ner

sbb_ned

neat

trs

TSV documentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages