Skip to content

Training data and code for a model to detect accession numbers in images from the National Library of Mongolia

License

Notifications You must be signed in to change notification settings

buda-base/nlm-numbers-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nlm-numbers-model

Training data and code for a model to detect accession numbers in images from the National Library of Mongolia.

Description of the project

The Buddhist Digital Resource Center (BDRC) is the main digitization partner of the National Library of Mongolia (NLM, for more, see this blog post). Scans of about 6,000 volumes are already openly accessible on the BDRC website here.

So far, BDRC only imported information on the first text of each volume, but this just a drop in the ocean since the NLM identified 70,000 different texts in these volumes. The catalog of these texts is accessible to BDRC but in order to import it properly, BDRC needs to map the titles with the images, a task that would take months or years to finish.

Fortunately, the NLM cataloguers wrote an accession number on the first page of each text, as exemplified in the following image:

First page of an NLM volume.

This project leverages AI to detect these numbers and automate the mapping from the NLM catalog to the images themselves.

Creation of the data (BDRC)

nlm-volumeinfos.csv contains information about each volume of the NLM collection, including a list of all the numbers that should appear on the images. Note that the model does not need to be used on the 613 volumes having only one text. Warning: not all volumes have images yet.

The imageinfos/ directory contains csv files named after each volume, listing the images in the volume with the following columns:

  • the original image file name
  • the BDRC image number
  • width of the original image
  • height of the original image

Since the NLM cataloguers seem to have been very consistent in the way they wrote the number, we prepare the image for processing by:

  • cropping the right side (80% of the image)
  • rotating the images 90° counter-clockwise

Running the inference (BDRC)

When a new batch of scans become available:

add the results of the following query to allw.csv:

select ?w ?i {
  ?w :inCollection bdr:PR1NLM00 .
  FILTER(strstarts(str(?w), 'http://purl.bdrc.io/resource/W1NLM'))
  ?wadm adm:adminAbout ?w ;
        adm:status bda:StatusReleased .
  ?w :instanceHasVolume ?i .
  ?i :volumePagesTotal ?vpt .
  FILTER(?vpt > 2)
}

run create_initial_csvs.py

run nlm-numbers-private/create-ai-dataset.py and copy nlm-volumeinfos.csv in this repository.

Put the total number of images for each volume in w-vpt.csv, this can be done through the following query:

select ?w ?vpt {
  ?w :inCollection bdr:PR1NLM00 .
  FILTER(strstarts(str(?w), 'http://purl.bdrc.io/resource/W1NLM'))
  ?wadm adm:adminAbout ?w ;
        adm:status bda:StatusReleased .
  ?w :instanceHasVolume ?i .
  ?i :volumePagesTotal ?vpt .
  FILTER(?vpt > 2)
}

run nlm_classifier_infer.py on a machine with GPU, and copy the new files in s3://image-processing.bdrc.io/nlm-numbers/Aresults/xce_model/ in a directory in results/.

Analyzing the results (BDRC or ALL)

run analyze-results.py a few times (TODO), looking at the different lists for debug.

This produces two files:

  • outline.csv with the image numbers for all the volumes where the number of detected stamps matches the number of texts in the catalog
  • outline_needs_review.csv with the best approximation of image numbers for the volumes where the number of detected stamps is different from the number of texts in the catalog

These two files should then be copied into nlm-numbers-private.

About

Training data and code for a model to detect accession numbers in images from the National Library of Mongolia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages