Skip to content

PmatchContainer

eaxelson edited this page May 14, 2018 · 17 revisions

class PmatchContainer

A class for performing pattern matching.

Probably the easiest way to perform pattern matching is with functions hfst.compile_pmatch_expression and hfst.compile_pmatch_file


__init__ (self)

Initialize a PmatchContainer. Is this needed?


__init__ (self, defs)

Create a PmatchContainer based on definitions defs.

  • defs: A tuple of transducers in HFST_OLW_TYPE defining how pmatch is done.

An example:

If we have a file named streets.txt that contains:

define CapWord UppercaseAlpha Alpha* ;
define StreetWordFr [{avenue} | {boulevard} | {rue}] ;
define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ;
define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ;
regex StreetFr EndTag(FrenchStreetName) ;

and which has been earlier compiled and stored in file streets.pmatch.hfst.ol:

defs = hfst.compile_pmatch_file('streets.txt')
ostr = hfst.HfstOutputStream(filename='streets.pmatch.hfst.ol', type=hfst.ImplementationType.HFST_OLW_TYPE)
for tr in defs:
    ostr.write(tr)
ostr.close()

we can read the pmatch definitions from file and perform string matching with:

istr = hfst.HfstInputStream('streets.pmatch.hfst.ol')
defs = []
while(not istr.is_eof()):
    defs.append(istr.read())
istr.close()
cont = hfst.PmatchContainer(defs)
assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."

See also: hfst.compile_pmatch_file, hfst.compile_pmatch_expression


match (self, input, time_cutoff = 0)

Match input input.


get_profiling_info (self)

todo


set_verbose (self, b)

todo


set_extract_tags_mode (self, b)

todo


set_profile (self, b)

todo


locate(self, input, time_cutoff, weight_cutoff)

The locations of pmatched strings for string input where the results are limited as defined by time_cutoff and weight_cutoff.

  • input : The input string.
  • time_cutoff : Time cutoff, defaults to zero, i.e. no cutoff.
  • weight_cutoff : Weight cutoff, defaults to infinity, i.e. no cutoff.

Returns: A tuple of tuples of Location.


tokenize(self, input)

Tokenize input and return a list of tokens i.e. strings.

  • input: The string to be tokenized.

get_tokenized_output(self, input, **kwargs)

Tokenize input and get a string representation of the tokenization (essentially the same that command line tool hfst-tokenize would give).

  • input: The input string to be tokenized.
  • kwargs: Possible parameters are: output_format, max_weight_classes, dedupe, print_weights, print_all, time_cutoff, verbose, beam, tokenize_multichar.
  • output_format: The format of output; possible values are tokenize, xerox, cg, finnpos, giellacg, conllu and visl; tokenize being the default.
  • max_weight_classes: Maximum number of best weight classes to output (where analyses with equal weight constitute a class), defaults to None i.e. no limit.
  • dedupe: Whether duplicate analyses are removed, defaults to False.
  • print_weights: Whether weights are printd, defaults to False.
  • print_all: Whether nonmatching text is printed, defaults to False.
  • time_cutoff: Maximum number of seconds used per input after limiting the search.
  • verbose: Whether input is processed verbosely, defaults to True.
  • beam: Beam within analyses must be to get printed.
  • tokenize_multichar: Tokenize input into multicharacter symbols present in the transducer, defaults to false.
Clone this wiki locally