Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Latest commit

 

History

History
41 lines (26 loc) · 1.87 KB

deprecated-pipeline-examples.md

File metadata and controls

41 lines (26 loc) · 1.87 KB

Deprecated Pipeline Examples

The following example pipelines are currently deemed deprecated, in favour of the Simple Pipeline (which can be run via Apache Beam as well as the API Server.

The pipelines documented here can only be run via Apache Beam. However, for complex requirements this might be the right choice.

Grobid Example Pipeline

This pipeline will run Grobid is used for the actual conversion.

To run the example conversion with the defaults:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf"

That will automatically download and run a Grobid Service instance.

Or specify the Grobid URL and file suffix (in that case the Grobid Service is assumed to be running):

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
 --grobid-url http://localhost:8080 --output-suffix .tei-header.xml

Or specify an XSLT transformation, e.g. using grobid-jats.xsl:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
 --xslt-path grobid-jats.xsl

Assuming you have already authenticated with Google's Cloud SDK you can also work with buckets by specifying the URL:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "gs://example_bucket/path/to/pdfs/*.pdf"

Extending the Pipeline (deprecated)

You can use the grobid_service_pdf_to_xml.py example as a template and add your own steps.