A set of related projects to extract text from PDFs.
A prebuilt version of the most recent version of TextWorks can be fetched like so:
From the github release page https://github.com/iesl/watr-works/releases/
Download the latest binary release (under 'Assets'): textworks-app.tar.gz
Extract the application, e.g.,
tar xvfz textworks-app.tar.gz
The extracted app will have the following structure:
textworks-app
├── bin
│ └── textworks
└── lib
└── *.jar
Place the bin/ directory on your path, or run it as
/path/to/textworks-app/bin/textworks --help
to see options.
textworks --file document.pdf --output document.json
Specify --overwrite if the output file already exists
Textworks expects the PDFs to be organized such that there is one subdirectory per PDF, and all generated artifacts (as well as the original PDF) are grouped together in the same directory. The textworks app can perform the initializions step by creating appropriately named subdirectories (hereafter called artifact directories), and stashing each pdf in its artifact directory.
To initialize a corpus, place the PDFs in some directory:
./corpus/root
├── alpha.pdf
├── beta.pdf
└── gamma.pdf
Then run:
textworks init --corpus ./path/to/corpus/root
This will result in the following structure:
./corpus/root
├── .corpus-root
├── alpha.pdf.d
│ └── alpha.pdf
├── beta.pdf.d
│ └── beta.pdf
└── gamma.pdf.d
└── gamma.pdf
There is a marker file named '.corpus-root' placed in the root directory. It is created by textworks, when initializing a corpus, and later checked when processing corpus entries.
New PDFs may be added to the root of the corpus directory and the init step may be re-run at any time, without disturbing existing entries.
By default, textworks will process everything in the specified corpus directory.
textworks --corpus ./path/to/corpus/root [--overwrite]
This will write a file named textgrid.json to the root of each pdf artifact directory. It will check for the existence of this file first, skipping PDFs that have already been processed, unless --overwrite is specified.
After processing, the directory will look like so:
./corpus/root
├── alpha.pdf.d
│ ├── textgrid.json
│ └── alpha.pdf
├── beta.pdf.d
│ ├── textgrid.json
│ └── beta.pdf
└── gamma.pdf.d
├── textgrid.json
└── gamma.pdf
Segmentation produces a json file like this:
{"description" : "Extracted Pages for 10.1101-056275.d",
"documentId" : "10.1101-056275.d",
"pages" : [{"pageGeometry" : [0, 0, 61200, 79200],
"textgrid" : {"stableId" : "10.1101-056275.d",
"rows" : [{"offset" : 0,
"text" : "Prediction of allosteric sites and mediating interactions through bond-to-bond",
"loci" : [{"g" : [["P", 0, [7792, 5383, 798, 820]]]
{ "pages": [
{"page": 0,
"blocks": [
{"labels": ["header", "publisher"],
"lines": [
[["©","2014","Elsevier","B.V.","All","rights","reserved."], [0,1,2,3,4,5,6]]
]},
Each page consists of a list of blocks, where each block is a labeled set of text lines. The labels identify the section of the paper where the text appeared (header, body, or references), and, if it can be determined, the role of that block of text (section-heading, paragraph, caption, table). The text blocks appear in reading-order, as best as can be determined by layout analysis.
Each line in a block of text is a comma separated list of tokens, followed by an equal length list of IDs for each token. At the end of the json file is an "id" section, which lists, every token id with the format
[idnum, [pagenum, [top,left,width,height]]]
which describes the page number and bounding box for that particular token.
Some portions of a token may have special formatting information, like super- or sub-script annotation. If this is the case, The token will begin and end with braces, like so:
"{PhSiH_{3}}"
Within the token, superscripts are designated with ^{...}, subscripts with _{...}.
The following characters are always escaped: standard json escapes: '"', '', .. braces: '{', '}'
Inside of a tex-formatted token, '_', '^' are also escaped.