Scripts for Processing the Digital Second Edition of Judaica Americana

This repository contains various Python scripts used in the creation of datasets from Robert Singerman's Judaica Americana: A Bibliography of Publications to 1900. These datasets were used as the foundation of the Digital Second Edition of Judaica Americana.

These scripts include:

extract_singerman.py: for extracting the data from the JA draft and writing into a csv
flip-index-headers.py: for creating a csv of Singerman IDs and corresponding index headers
extract_singerman_serials.py: for extracting the data from the JA draft and writing into a csv re: serials
tess.py: forked from tess, written by Jonathan Scott Enderle. tess was used to OCR from the index from the JA print publication

More Information on tess

An extremely basic python script for converting PDFs to TIFFs and performing OCR with tesseract.

To run the script, first ensure that ImageMagick 7 and Tesseract 4 are installed and can be run from the command line. (Later versions may work but this has only been tested with the above versions.)

You'll also need to ensure that the correct language models are installed. The tesseract wiki has installation instructions for various operating systems.

Once you have the software installed, you can run the script:

tess.py [--language LANGUAGE] files [files ...]

---------

files:          One or more PDF files to process
--language:     The tesseract language ID code for the language model
                    to use. E.g. eng (English), deu (German) or 
                    ita (Italian). The default is eng.

An Italian-language sample file is provided in the testdata folder. To process it, run the below command:

tess.py --language ita testdata/1961_Alessandria.pdf

This is a Judaica Digital Humanities at the Penn Libraries repository.

Judaica Digital Humanities at the Penn Libraries (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our website.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
testdata		testdata
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_singerman.py		extract_singerman.py
extract_singerman_serials.py		extract_singerman_serials.py
flip-index-headers.py		flip-index-headers.py
tess.py		tess.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts for Processing the Digital Second Edition of Judaica Americana

More Information on tess

This is a Judaica Digital Humanities at the Penn Libraries repository.

About

Releases 1

Packages

Languages

License

judaicadh/ja2-scripts

Folders and files

Latest commit

History

Repository files navigation

Scripts for Processing the Digital Second Edition of Judaica Americana

More Information on tess

This is a Judaica Digital Humanities at the Penn Libraries repository.

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages