Skip to content

CLI tool to process directories (from OCR JSON output to NER with segmented entries)

Notifications You must be signed in to change notification settings

soduco/processor-ner

Repository files navigation

Transform jsons into a raw xml file

Input format:

  • A directory with layout {directory}/{####}.json

Output format

  • XML

Usage

usage: __main__.py [-h] -i INPUT_DIR -o OUTPUT_FILE [-f RANGE_FROM] [-u RANGE_UPTO]

NER command line interface

options:
  -h, --help            show this help message and exit
  -i INPUT_DIR, --input-dir INPUT_DIR
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
  -f RANGE_FROM, --range-from RANGE_FROM
                        Lower bound of the range, inclusive. Assumes file names match pattern `.*?[0-9]+.json`.
  -u RANGE_UPTO, --range-upto RANGE_UPTO
                        Upper bound of the range, inclusive. Assumes file names match pattern `.*?[0-9]+.json`.
  --inplace             Edit inplace the json files to add entries inside.

Example, running the command:

pipenv run python -m ner_seg -i tests -o -

will stream the xml on the standard output:

<ACT>pot, broches de cuisinières, et tout
ce qui concerne cette partie</ACT>, <LOC>Tem-
ple</LOC>, <CARDINAL>69</CARDINAL>.</ENTRY>
<ENTRY><PER>Caron (P.</PER>), <ACT>ingénieur-mécanicien</ACT>,
ci-devant <LOC>Faub. -St-Martin</LOC>, <CARDINAL>147</CARDINAL>,
...

Docker

To generate a Docker image ready to run on any Linux platform and use it, follow the process below.

  1. Build the image:
    docker build -t soduco/processor-ner .
  2. (Opt.) Export image from build machine and import it into processing machines:
    docker image save soduco/processor-ner | pigz > soduco-processor-ner.tar.gz
    … copy the image to the target machine and then … docker image load < soduco-processor-ner.tar.gz
  3. Create a container
  4. Launch the process, using some bind-mounted input and output directories:
    docker run --name soduco-processor-ner --rm -it -v /work/soduco/202308-reprocess/input/DIR:/input:ro -v /work/soduco/202308-reprocess/output:/output soduco/processor-ner -o /output/DIR-RLO:RHI.xml -f RLO -u RHI Note that you only need to provide the --output-file, --range-from and --range-upto on the command line.

About

CLI tool to process directories (from OCR JSON output to NER with segmented entries)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published