Skip to content

Alignment: Usage

Isaac Schifferer edited this page Nov 18, 2023 · 2 revisions

Aligning texts

Alignment experiment folders only require src.txt and trg.txt files to be run. A config file will be generated automatically for the experiment, but one can still be created manually to customize the alignment.

align

Aligns the parallel corpora for the designated experiments.

usage: python -m silnlp.alignment.align [-h] [--aligners [aligner [aligner ...]]]
[--skip-align] [--skip-extract-lexicon]
experiments

Arguments:

Argument Purpose Description
experiments Experiment pattern The pattern of the experiment subfolders where the configuration files will be generated. The subfolders must be located in the SIL_NLP_DATA_PATH > Alignment > experiments folder.
--aligners [aligner [aligner ...]] List of aligners List of aligners to use to align each corpus.
--skip-align Skip aligning corpora Skip aligning corpora.
--skip-extract-lexicon Skip extracting lexicons Skip extracting lexicons.

bulk_align

Aligns source Bible to defined set of Bibles.

usage: python -m silnlp.alignment.bulk_align [-h] src_path trg_dir
output_dir [--aligner ALIGNER] [--multiprocess]

Arguments:

Argument Purpose Description
src_path Path to source Bible text Path to source Bible text.
trg_dir Folder of Bibles to align to Folder of Bibles to align to.
output_dir Folder to contain Bible alignments Folder to contain Bible alignments.
--aligner ALIGNER Aligner to use Aligner to use for extraction. Default is "fast_align".
--multiprocess Use multiple processes Use multiple processes, that is if the chosen alignement algorithm does not do so already.

test

Tests generated alignments against gold standard alignments.

usage: python -m silnlp.alignment.test [-h] [--combine-pattern PATTERN]
[--test-size SIZE] [--books [book [book ...]]] [--by-book]
experiments

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Alignment > experiments folder.
--combine-pattern PATTERN Combine pattern Combine pattern.
--test-size Test size Set the number of verse alignments to test. If test size is greater than the total number of verses, the verses tested will be selected randomly.
--books [book [book ...]] Books to score Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)
--by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books.

Miscellaneous commands

preprocess

Preprocesses Clear gold standard alignments.

usage: python -m silnlp.alignment.preprocess [-h] experiments

Arguments:

Argument Purpose Description
experiments Experiment pattern The pattern of the experiment subfolders where the configuration files will be generated. The subfolders must be located in the SIL_NLP_DATA_PATH > Alignment > experiments folder.

generate_clear_models

Generates translation model for Clear from an alignment model.

usage: python -m silnlp.alignment.preprocess [-h] --aligner ALIGNER --output PATH experiments

Arguments:

Argument Purpose Description
experiments Experiment pattern The pattern of the experiment subfolders where the configuration files will be generated. The subfolders must be located in the SIL_NLP_DATA_PATH > Alignment > experiments folder.
--aligner ALIGNER Aligner Aligner to use.
--output PATH Output directory Output directory.

test_size

Finds the optimal size for a gold standard.

usage: python -m silnlp.alignment.test_size [-h] [--threshold THRESHOLD]
[--test-size SIZE] [--books [book [book ...]]] experiments

Arguments:

Argument Purpose Description
experiments Experiment pattern The pattern of the experiment subfolders where the configuration files will be generated. The subfolders must be located in the SIL_NLP_DATA_PATH > Alignment > experiments folder.
--threshold THRESHOLD Similarity threshold Similarity threshold.
--test-size Test size Set the number of verse alignments to test. If test size is greater than the total number of verses, the verses tested will be selected randomly.
--books [book [book ...]] Books to score Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)

visualize_similarity

Visualize similarity of languages/projects.

usage: python -m silnlp.alignment.visualize_similarity [-h] --corpus PATH --metadata PATH
--scores PATH [--image PATH] [--country COUNTRY] [--family FAMILY]
[--aligner ALIGNER] [--recompute] [--graph-type TYPE]
[--data-type TYPE] [--threshold THRESHOLD]

Arguments:

Argument Purpose Description
--corpus PATH The corpus folder The corpus folder.
--metadata PATH The metadata file The metadata file.
--scores PATH The similarity scores file The similarity scores file.
--image PATH The image file The image file.
--country COUNTRY The country to include The country to include.
--family FAMILY The language family to include The language family to include.
--aligner ALIGNER The alignment model The alignment model.
--recompute Recompute similarity scores Recompute similarity scores.
--graph-type Type of graph Type of graph. Can be "tree" or "network".
--data-type Type of data Type of data. Can be "language" or "project".
--threshold THRESHOLD Similarity threshold Similarity threshold.