GitHub - ai-ku/fastsubs: Generate most likely substitutes for words in a given text based on an n-gram language model.

ai-ku / fastsubs Public

Notifications You must be signed in to change notification settings
Fork 3
Star 10

Generate most likely substitutes for words in a given text based on an n-gram language model.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.gitignore		.gitignore
ChangeLog		ChangeLog
LICENSE		LICENSE
Makefile		Makefile
README		README
dlib.c		dlib.c
dlib.h		dlib.h
fastsubs-main.c		fastsubs-main.c
fastsubs-omp.c		fastsubs-omp.c
fastsubs-test.c		fastsubs-test.c
fastsubs.c		fastsubs.c
fastsubs.h		fastsubs.h
heap.c		heap.c
heap.h		heap.h
lm-test.c		lm-test.c
lm.c		lm.c
lm.h		lm.h
lmheap-test.c		lmheap-test.c
ngram.c		ngram.c
ngram.h		ngram.h
normalize-subs.pl		normalize-subs.pl
sentence-test.c		sentence-test.c
sentence.c		sentence.c
sentence.h		sentence.h
token.h		token.h
wordsub-n.c		wordsub-n.c
wordsub.c		wordsub.c

Repository files navigation

FASTSUBS		            Copyright (c) 2012-2014 Deniz Yuret

Usage: fastsubs [-n <n> | -p <p>] model.lm[.gz] < input.txt

Fastsubs is a program that finds the most likely substitutes for words
in input.txt using the language model in model.lm (ARPA format).  The
number of substitutes is controlled by two parameters: -n specifies
the number of substitutes for each word, -p specifies a threshold on
the sum of the substitute probabilities.  The program stops generating
substitutes for a word when either limit is satisfied.  The output has
an input word and its substitutes with the logarithm (base 10) of
their unnormalized probabilities in the following format:

word <tab> sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...

Please see the file LICENSE for terms of use.  A paper with the
description of the algorithm and other related resources can be found
at http://goo.gl/jzKH0.  The code is mostly C99 compliant and has been
compiled and tested on a linux system with gcc.  The few GNU
extensions used are optional and can be turned off by editing the
beginning of dlib.h.

Update of Jan 9, 2014: The latest version gets rid of all glib
dependencies (glib is broken, it blows up your code without warning if
your arrays or hashes get too big).  It also fixes a bug: The previous
versions of the code and the paper assumed that the log back-off
weights were upper bounded by 0 which they are not.  In my standard
test of generating the top 100 substitutes for all the 1,222,974
positions in the PTB, this caused a total of 66 low probability
substitutes to be missed and 31 to be listed out of order.  Finally a
multithreaded version, fastsubs-omp is implemented.  The number of
threads can be controlled using the environment variable
OMP_NUM_THREADS.

fastsubs-omp can be used with a few additional run-time options:

Usage: fastsubs-omp [-n <n> | -p <p> | -m <max-threads> | -t | -z ] model.lm[.gz] < input.txt

-m specifies the maximum number of threads that can be used
-z specifies that the output probability values would be normalized probabilities that sum up to 1 (instead of the default unnormalized log probabilities)
-t specifies a one-target-per-text-line input format (instead of the default free text format). 
In this format every input line is: 
target_name <tab> target_id <tab> target_index <tab> text_line 
Then for every input line fastsubs-omp outputs two lines:
1) the input line
2) substitutes for the text_line[target_index] (i.e. the word in the target_index position in text_line) in the format:
sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...


Other test and utility executables that can be built using the
Makefile:

* fastsubs-omp: Multithreaded version.

* wordsub: Takes the output of fastsubs and samples random substitutes
  for each word.

* normalize-subs.pl: Takes fastsubs output and converts the
  unnormalized log10(p) entries to normalized p entries that add up to
  1.0 (which may not be accurate if fastsubs did not output the whole
  vocabulary.)

* fastsubs-test: Takes a model file and interactively runs fastsubs on
  sentences typed by the user.

* lmheap-test: Takes a model file and tests the heaps in the lm data
  structure by interactively reading n-grams from the user and
  printing out the contents of the heap for each position.

* sentence-test: Takes a model file and tests the sentence and lm data
  structures by reading sentences from the user and printing out logp
  for each word.

* lm-test: Takes a model file and tests the lm data structure by
  reading n-grams from the user and printing out their logp and
  back-off weights.