-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF processing #16
Comments
ami-pdf will read the PDFs in bulk and split into characters and images.
After that we need to know the application.
Try
http://discuss.contentmine.org/t/cm-ucl-ii-semantic-content-enhancement-of-table-data/396/2
for an overview of extracting tables
You need to be able to run the latest ami-pdf which is available in the
ami-jars repo. https://github.com/petermr/ami-jars
There is no simple tutorial - for text only I would use GROBID , for tables
and diagrams AMI.
In haste - more later.
…On Thu, Sep 19, 2019 at 10:11 AM Simon Worthington ***@***.***> wrote:
Can you point me the the part of ContentMine or the instructions for
processing and extracting PDF parts. Also is there an example of a source
document and the outputs.
I am asking as some colleagues have a PDF document set that they need to
extract and enrich components from.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=AAFTCS3DIJ4DEPMIWH2BFGDQKM63LA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HMLLK4Q>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS4MNJXT5IOABU4WW2LQKM63LANCNFSM4IYIQCTQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
will have a go, much appreciated |
Much of this is available through java Tests on petermr/normami now moved
to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so
probably limited value.
Back in 20 mins
…On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington ***@***.***> wrote:
will have a go, much appreciated
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
How many documents do you have?
The first step is to trun them into A CProject
put them in a directory e.g. simon20190919
then
ami-makeproject gives the help
then
ami-makeproject -p simon20190919 -f pdf
should do it.
Please record everything here including the new Cproject
On Thu, Sep 19, 2019 at 11:04 AM Peter Murray-Rust <
peter.murray.rust@googlemail.com> wrote:
… Much of this is available through java Tests on petermr/normami now moved
to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so
probably limited value.
Back in 20 mins
On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington <
***@***.***> wrote:
> will have a go, much appreciated
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ>
> .
>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
25k docs I think, very mixed over multiple decades :-) I'll send you a sample doc and quickly describe what we want to extract. And thank you for your time. If you can give your view on the doc I send it might shortcut things a little. You can just say 'yay', 'nay' if we're going to have any luck. |
Here's a stack of
|
dont send it, |
from the 25K try to select ca 20 which are:
|
I'll check but I think copyright questions, yes. But I'll check first. |
if it's publicly visible I'm happy. We did that with phylotrees |
happy to talk on phone/skype if helps |
if you have 100-year old records as bitmaps I am happy to try those, but they must be homogenous in type |
I need to wait for colleagues to get docs :-) |
see table extraction at http://discuss.contentmine.org/t/ami-eppi-cm-ucl-table-extraction-project/322/14 |
even one doc would be a useful start. |
Would like to show something for my school visit in 10 days. |
https://edocs.tib.eu/files/e01fb19/1676027963.pdf has https://creativecommons.org/licenses/by/3.0/de. I'll look for some more, might take some minutes. |
I have processed your first PDF and uploaded the results.
It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text.
See if you can make some sense. The SVG is in pages
…On Thu, Sep 19, 2019 at 1:57 PM hauschke ***@***.***> wrote:
http://creativecommons.org/licenses/by/4.0/,
https://edocs.tib.eu/files/e01fb19/1666373214.pdf
http://creativecommons.org/licenses/by-sa/4.0/,
https://edocs.tib.eu/files/e01fb19/1670198502.pdf
https://creativecommons.org/licenses/by-nc-nd/4.0/,
https://edocs.tib.eu/files/e01fb19/1667335782.pdf
http://creativecommons.org/licenses/by-sa/4.0/,
https://edocs.tib.eu/files/e01fb19/1665279796.pdf
http://creativecommons.org/licenses/by-sa/4.0/,
https://edocs.tib.eu/files/e01fb19/166506773X.pdf
Some more for testing. Sorry, I could deliver some dozens more, but I hope
that's enough for a trial.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
The next 5 don't seem very relevant to climate change? It's not clear what
would be extracted.
I want to stick to climate and specific types of information -
tables/graphs vs time, e.g.
On Thu, Sep 19, 2019 at 1:59 PM Peter Murray-Rust <
peter.murray.rust@googlemail.com> wrote:
… I have processed your first PDF and uploaded the results.
It extracts the bitmaps and characters as SVG. I will revisit my SVG 2
text.
See if you can make some sense. The SVG is in pages
On Thu, Sep 19, 2019 at 1:57 PM hauschke ***@***.***> wrote:
> http://creativecommons.org/licenses/by/4.0/,
> https://edocs.tib.eu/files/e01fb19/1666373214.pdf
> http://creativecommons.org/licenses/by-sa/4.0/,
> https://edocs.tib.eu/files/e01fb19/1670198502.pdf
> https://creativecommons.org/licenses/by-nc-nd/4.0/,
> https://edocs.tib.eu/files/e01fb19/1667335782.pdf
> http://creativecommons.org/licenses/by-sa/4.0/,
> https://edocs.tib.eu/files/e01fb19/1665279796.pdf
> http://creativecommons.org/licenses/by-sa/4.0/,
> https://edocs.tib.eu/files/e01fb19/166506773X.pdf
>
> Some more for testing. Sorry, I could deliver some dozens more, but I
> hope that's enough for a trial.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ>
> .
>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
We'll assemble a small climate change collection, will take a few days though. Also will get hold of an example list of items want to extract. The context is wanting to make final research reports more visible so as to make them part of the research corpus in a more usable way. The climate change related reports would sit within the bigger body of research reports. If you can share back the current SVG outputs that would be great. |
Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.
I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.
The text was updated successfully, but these errors were encountered: