PDF processing #16

mrchristian · 2019-09-19T09:11:17Z

Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.

I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.

petermr · 2019-09-19T09:46:34Z

ami-pdf will read the PDFs in bulk and split into characters and images. After that we need to know the application. Try http://discuss.contentmine.org/t/cm-ucl-ii-semantic-content-enhancement-of-table-data/396/2 for an overview of extracting tables You need to be able to run the latest ami-pdf which is available in the ami-jars repo. https://github.com/petermr/ami-jars There is no simple tutorial - for text only I would use GROBID , for tables and diagrams AMI. In haste - more later.

…

On Thu, Sep 19, 2019 at 10:11 AM Simon Worthington ***@***.***> wrote: Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs. I am asking as some colleagues have a PDF document set that they need to extract and enrich components from. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16?email_source=notifications&email_token=AAFTCS3DIJ4DEPMIWH2BFGDQKM63LA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HMLLK4Q>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCS4MNJXT5IOABU4WW2LQKM63LANCNFSM4IYIQCTQ> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

mrchristian · 2019-09-19T09:57:45Z

will have a go, much appreciated

petermr · 2019-09-19T10:04:25Z

Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins

…

On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington ***@***.***> wrote: will have a go, much appreciated — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-09-19T10:25:14Z

How many documents do you have? The first step is to trun them into A CProject put them in a directory e.g. simon20190919 then ami-makeproject gives the help then ami-makeproject -p simon20190919 -f pdf should do it. Please record everything here including the new Cproject On Thu, Sep 19, 2019 at 11:04 AM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

…

Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington < ***@***.***> wrote: > will have a go, much appreciated > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ> > . > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

mrchristian · 2019-09-19T11:06:36Z

25k docs I think, very mixed over multiple decades :-) I'll send you a sample doc and quickly describe what we want to extract. And thank you for your time. If you can give your view on the doc I send it might shortcut things a little. You can just say 'yay', 'nay' if we're going to have any luck.

petermr · 2019-09-19T11:06:46Z

Here's a stack of ami commands

#! /bin/sh

# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot

# edit this to your own directory
# STATA="/Users/pm286/projects/forestplots/stataforestplots"
# STATA="/Users/pm286/projects/forestplots/_stataok"
WORKSPACE=$HOME/workspace/
FOREST_TOP=$WORKSPACE/projects/forestplots
MID_DIR=test20190804
FOREST_MID=$FOREST_TOP/$MID_DIR
LOW_DIR=_stataok
FOREST_DIR=$FOREST_MID/$LOW_DIR

CPROJECT=$FOREST_DIR
CTREE_NAME=PMC6127950
#CTREE_NAME=PMC5882397
CTREE=$CPROJECT/$CTREE_NAME

echo CTREE $CTREE

while getopts p:t: option
do
case "${option}"
in
p) CPROJECT=${OPTARG};;
t) CTREE=${OPTARG};;
esac
done


# choose the first SOURCE to run a single CTree, the second to run a CProject (long). 
# Comment in the one you want
SOURCE=" -t $CTREE"
# SOURCE=" -p $CPROJECT"
echo $CTREE
ls $CTREE

# images 
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages

# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale

HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"

SLEEP1=1
SLEEP5=5

# make project from a directory (CPROJECT) containing PDFs. 
# a no-op here as EuPMC has already done this

ami-makeproject -p $CPROJECT --rawfiletypes pdf

# convert PDFs to CTrees

ami-pdf $SOURCE

# image processing at 3 threshold levels (later will try to make this an AMI loop)

ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true

echo "===============Finished AmiImage============="
sleep $SLEEP1

# run OCR both types

ami-ocr $SOURCE --gocr      /usr/local/bin/gocr      --extractlines gocr               --forcemake
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false  --forcemake

echo "===============Finished AmiOcr============="
sleep $SLEEP1

# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values 
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be 
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified 
# in the stylesheet and values computed from applying ami-pixel to the images

ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
                --minheight -1 --rings -1 --islands 0 \
			    --inputname $RAW230DS \
			    --subimage statascale y 2 delta 10 projection x \
			    --templateinput $RAW230DS/projections.xml \
			    --templateoutput template.xml \
			    --templatexsl /org/contentmine/ami/tools/stataTemplate.xsl

echo "===============Finished AmiPixel============="
sleep $SLEEP5

# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
			    
ami-forestplot $SOURCE --template $RAW230DS/template.xml

echo "===============Finished AmiForest============="
sleep $SLEEP5

#now re-run ami-image to enhance each subimage separately

ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true

echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5

# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr

echo "===============Finished Tesseract ============="
sleep $SLEEP5

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr

echo "===============Finished GOCR ============="
sleep $SLEEP5

petermr · 2019-09-19T11:07:57Z

dont send it,
add it in a new folder here unless there are copyright issues

petermr · 2019-09-19T11:10:17Z

from the 25K try to select ca 20 which are:

newish (old docs are problematc, but maybe that is the point)
born digital if possible
OPEN (we cannot have takedowns)
show the range of problems
make clear what needs extracted

mrchristian · 2019-09-19T11:10:42Z

I'll check but I think copyright questions, yes. But I'll check first.

petermr · 2019-09-19T11:12:42Z

if it's publicly visible I'm happy. We did that with phylotrees
We are allowed to extract data if we can legally read it somewhere. Doesn't have to be CC BY. Also I dont think stopping Climate research is good PR

petermr · 2019-09-19T11:13:10Z

happy to talk on phone/skype if helps

petermr · 2019-09-19T11:14:15Z

if you have 100-year old records as bitmaps I am happy to try those, but they must be homogenous in type

mrchristian · 2019-09-19T11:18:39Z

I need to wait for colleagues to get docs :-)

petermr · 2019-09-19T12:07:50Z

see table extraction at http://discuss.contentmine.org/t/ami-eppi-cm-ucl-table-extraction-project/322/14

petermr · 2019-09-19T12:08:49Z

even one doc would be a useful start.
can tackle it in next 1.5 hours

petermr · 2019-09-19T12:09:36Z

Would like to show something for my school visit in 10 days.

hauschke · 2019-09-19T12:36:03Z

https://edocs.tib.eu/files/e01fb19/1676027963.pdf has https://creativecommons.org/licenses/by/3.0/de. I'll look for some more, might take some minutes.

hauschke · 2019-09-19T12:57:45Z

http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf
http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf
https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf
http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf
http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf

Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.

petermr · 2019-09-19T12:59:47Z

I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text. See if you can make some sense. The SVG is in pages

…

On Thu, Sep 19, 2019 at 1:57 PM hauschke ***@***.***> wrote: http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-09-19T13:10:51Z

The next 5 don't seem very relevant to climate change? It's not clear what would be extracted. I want to stick to climate and specific types of information - tables/graphs vs time, e.g. On Thu, Sep 19, 2019 at 1:59 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

…

I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text. See if you can make some sense. The SVG is in pages On Thu, Sep 19, 2019 at 1:57 PM hauschke ***@***.***> wrote: > http://creativecommons.org/licenses/by/4.0/, > https://edocs.tib.eu/files/e01fb19/1666373214.pdf > http://creativecommons.org/licenses/by-sa/4.0/, > https://edocs.tib.eu/files/e01fb19/1670198502.pdf > https://creativecommons.org/licenses/by-nc-nd/4.0/, > https://edocs.tib.eu/files/e01fb19/1667335782.pdf > http://creativecommons.org/licenses/by-sa/4.0/, > https://edocs.tib.eu/files/e01fb19/1665279796.pdf > http://creativecommons.org/licenses/by-sa/4.0/, > https://edocs.tib.eu/files/e01fb19/166506773X.pdf > > Some more for testing. Sorry, I could deliver some dozens more, but I > hope that's enough for a trial. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ> > . > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

mrchristian · 2019-09-19T14:46:43Z

We'll assemble a small climate change collection, will take a few days though. Also will get hold of an example list of items want to extract. The context is wanting to make final research reports more visible so as to make them part of the research corpus in a more usable way. The climate change related reports would sit within the bigger body of research reports. If you can share back the current SVG outputs that would be great.

mrchristian · 2019-09-26T08:08:58Z

Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.

Files

Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463

petermr · 2019-09-27T07:41:39Z

I am back in Cambridge so can start working on this.

…

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF processing #16

PDF processing #16

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019 •

edited

Loading

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 •

edited

Loading

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

hauschke commented Sep 19, 2019

hauschke commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

mrchristian commented Sep 26, 2019

petermr commented Sep 27, 2019 via email

PDF processing #16

PDF processing #16

Comments

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019 • edited Loading

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019 • edited Loading

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

mrchristian commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

petermr commented Sep 19, 2019

hauschke commented Sep 19, 2019

hauschke commented Sep 19, 2019

petermr commented Sep 19, 2019 via email

petermr commented Sep 19, 2019 via email

mrchristian commented Sep 19, 2019

mrchristian commented Sep 26, 2019

petermr commented Sep 27, 2019 via email

petermr commented Sep 19, 2019 •

edited

Loading

petermr commented Sep 19, 2019 •

edited

Loading