Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF processing #16

Open
mrchristian opened this issue Sep 19, 2019 · 23 comments
Open

PDF processing #16

mrchristian opened this issue Sep 19, 2019 · 23 comments

Comments

@mrchristian
Copy link
Contributor

Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.

I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.

@petermr
Copy link
Owner

petermr commented Sep 19, 2019 via email

@mrchristian
Copy link
Contributor Author

will have a go, much appreciated

@petermr
Copy link
Owner

petermr commented Sep 19, 2019 via email

@petermr
Copy link
Owner

petermr commented Sep 19, 2019 via email

@mrchristian
Copy link
Contributor Author

25k docs I think, very mixed over multiple decades :-) I'll send you a sample doc and quickly describe what we want to extract. And thank you for your time. If you can give your view on the doc I send it might shortcut things a little. You can just say 'yay', 'nay' if we're going to have any luck.

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

Here's a stack of ami commands

#! /bin/sh

# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot

# edit this to your own directory
# STATA="/Users/pm286/projects/forestplots/stataforestplots"
# STATA="/Users/pm286/projects/forestplots/_stataok"
WORKSPACE=$HOME/workspace/
FOREST_TOP=$WORKSPACE/projects/forestplots
MID_DIR=test20190804
FOREST_MID=$FOREST_TOP/$MID_DIR
LOW_DIR=_stataok
FOREST_DIR=$FOREST_MID/$LOW_DIR

CPROJECT=$FOREST_DIR
CTREE_NAME=PMC6127950
#CTREE_NAME=PMC5882397
CTREE=$CPROJECT/$CTREE_NAME

echo CTREE $CTREE

while getopts p:t: option
do
case "${option}"
in
p) CPROJECT=${OPTARG};;
t) CTREE=${OPTARG};;
esac
done


# choose the first SOURCE to run a single CTree, the second to run a CProject (long). 
# Comment in the one you want
SOURCE=" -t $CTREE"
# SOURCE=" -p $CPROJECT"
echo $CTREE
ls $CTREE

# images 
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages

# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale

HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"

SLEEP1=1
SLEEP5=5

# make project from a directory (CPROJECT) containing PDFs. 
# a no-op here as EuPMC has already done this

ami-makeproject -p $CPROJECT --rawfiletypes pdf

# convert PDFs to CTrees

ami-pdf $SOURCE

# image processing at 3 threshold levels (later will try to make this an AMI loop)

ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true

echo "===============Finished AmiImage============="
sleep $SLEEP1

# run OCR both types

ami-ocr $SOURCE --gocr      /usr/local/bin/gocr      --extractlines gocr               --forcemake
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false  --forcemake

echo "===============Finished AmiOcr============="
sleep $SLEEP1

# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values 
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be 
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified 
# in the stylesheet and values computed from applying ami-pixel to the images

ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
                --minheight -1 --rings -1 --islands 0 \
			    --inputname $RAW230DS \
			    --subimage statascale y 2 delta 10 projection x \
			    --templateinput $RAW230DS/projections.xml \
			    --templateoutput template.xml \
			    --templatexsl /org/contentmine/ami/tools/stataTemplate.xsl

echo "===============Finished AmiPixel============="
sleep $SLEEP5

# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
			    
ami-forestplot $SOURCE --template $RAW230DS/template.xml

echo "===============Finished AmiForest============="
sleep $SLEEP5

#now re-run ami-image to enhance each subimage separately

ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true

echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5

# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr

echo "===============Finished Tesseract ============="
sleep $SLEEP5

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr

echo "===============Finished GOCR ============="
sleep $SLEEP5

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

dont send it,
add it in a new folder here unless there are copyright issues

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

from the 25K try to select ca 20 which are:

  • newish (old docs are problematc, but maybe that is the point)
  • born digital if possible
  • OPEN (we cannot have takedowns)
  • show the range of problems
  • make clear what needs extracted

@mrchristian
Copy link
Contributor Author

I'll check but I think copyright questions, yes. But I'll check first.

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

if it's publicly visible I'm happy. We did that with phylotrees
We are allowed to extract data if we can legally read it somewhere. Doesn't have to be CC BY. Also I dont think stopping Climate research is good PR

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

happy to talk on phone/skype if helps

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

if you have 100-year old records as bitmaps I am happy to try those, but they must be homogenous in type

@mrchristian
Copy link
Contributor Author

I need to wait for colleagues to get docs :-)

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

even one doc would be a useful start.
can tackle it in next 1.5 hours

@petermr
Copy link
Owner

petermr commented Sep 19, 2019

Would like to show something for my school visit in 10 days.

@hauschke
Copy link

https://edocs.tib.eu/files/e01fb19/1676027963.pdf has https://creativecommons.org/licenses/by/3.0/de. I'll look for some more, might take some minutes.

@petermr
Copy link
Owner

petermr commented Sep 19, 2019 via email

@petermr
Copy link
Owner

petermr commented Sep 19, 2019 via email

@mrchristian
Copy link
Contributor Author

We'll assemble a small climate change collection, will take a few days though. Also will get hold of an example list of items want to extract. The context is wanting to make final research reports more visible so as to make them part of the research corpus in a more usable way. The climate change related reports would sit within the bigger body of research reports. If you can share back the current SVG outputs that would be great.

@mrchristian
Copy link
Contributor Author

Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.

Files

http://creativecommons.org/licenses/by-sa/3.0/de,https://edocs.tib.eu/files/e01fb19/1676027963.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076258.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076134.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027897045.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027879500.pdf
http://creativecommons.org/licenses/by-nd/4.0/deed/,https://edocs.tib.eu/files/e01fn18/1018823859.pdf
https://creativecommons.org/licenses/by-nd/4.0/deed.en,https://edocs.tib.eu/files/e01fn17/893648477.pdf
http://creativecommons.org/licenses/by/4.0/,https://edocs.tib.eu/files/e01fb17/881442836.pdf
http://creativecommons.org/licenses/by-nd/3.0/de/,https://edocs.tib.eu/files/e01fn16/864300328.pdf
http://creativecommons.org/licenses/by-nd/3.0/de/,http://edok01.tib.uni-hannover.de/edoks/e01fn17/857413724.pdf
http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/739959433.pdf
http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/719349311.pdf

Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463

@petermr
Copy link
Owner

petermr commented Sep 27, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants