A pure Python 2.7+ implementation of a PATRICIA trie for effcient matching of string collections on text.
Note that you probably first want to have a look at the Python wrapper marisa-trie or its PyPi package before using particia-trie; according to simple timeit comparisons, these wrappers for the C-based MARISA library are about twice as fast as this pure Python implementation.
patricia-trie does have its merits, however - it is small, clear, and has a very clean interface that imitates the dict API and works with Py3k.
pip install patricia-trie
>>> T = trie('root', key='value', king='kong') # a root value and two pairs >>> T['four'] = None # setting new values as in a dict >>> '' in T # check if the value exits (note: the [empty] root is '') True >>> 'kong' in T # existence checks as in a dict False >>> T['king'] # get the value for an exact key ... as in a dict 'kong' >>> T['kong'] # error from non-existing keys (as in a dict) Traceback (most recent call last): ... KeyError: 'kong' >>> len(T) # count keys ("terminals") in the tree 4 >>> sorted(T) # plus "traditional stuff": .keys(), .values(), and .items() ['', 'four', 'key', 'king'] >>> # scanning a string S with key(S), value(S), and item(S): >>> S = 'keys and kewl stuff' >>> T.key(S) # report the (longest) key that is a prefix of S 'key' >>> T.value(S, 9) # using offsets; NB: a root value always matches! 'root' >>> del T[''] # interlude: deleting keys (here, the root) >>> T.item(S, 9) # raise error if no key is a prefix of S Traceback (most recent call last): ... KeyError: 'k' >>> # info: the error string above contains the matched path so far >>> T.item(S, 9, default=None) # avoid the error by specifying a default (None, None) >>> # iterate all matching content with keys(S), values(S), and items(S): >>> list(T.items(S)) [('key', 'value')] >>> T.isPrefix('k') # reverse lookup: check if S is a prefix of any key True >>> T.isPrefix('kong') False >>> sorted(T.iter('k')) # and get all keys that have S as prefix ['key', 'king']
Deleting entries is a "half-supported" operation only. The key appears "removed", but the trie is not actually changed, only the node state is changed from terminal to non-terminal. I.e., if you frequently delete keys, the compaction will become fragmented and less efficient. To mitigate this effect, make a copy of the trie (using a copy constructor idiom):
T = trie(**T)
If you are only interested in scanning for the presence of keys, but do not
care about mapping a value to each key, using None
as the value of your
keys and scanning with key(S, None, start=i)
at every offset i
in the
string S
is perfectly fine (because the return value will be the key
string iff a full match was made and None
otherwise):
>>> T = trie(present=None) >>> T.key('is absent here', None, start=3) # start scanning at offset 3 >>> T.key('is present here', None, start=3) # start scanning at offset 3 'present'
- trie(
*value
,**branch
) - Create a new tree node.Any arguments will be used as the
value
of this node.If keyword arguments are given, they initialize a wholebranch
.Note that None is a valid value for a node. - trie.isPrefix(
prefix
) - Return True if any key starts with
prefix
. - trie.item(
string
,start=0
,end=None
,default=NULL
) - Return the key, value pair of the longest key that is a prefix of
string
(beginning atstart
and ending atend
).If no key matches, raise a KeyError or return the None,default
pair if anydefault
value was set. - trie.items([
string
[,start
[,end
]]]) - Return all key, value pairs (for keys that are a prefix of
string
(beginning atstart
(and terminating beforeend
))). - trie.iter(
prefix
) - Return an iterator over all keys that start with
prefix
. - trie.key(
string
,start=0
,end=None
,default=NULL
) - Return the longest key that is a prefix of
string
(beginning atstart
and ending atend
).If no key matches, raise a KeyError or return thedefault
value if it was set. - trie.keys([
string
[,start
[,end
]]]) - Return all keys (that are a prefix of
string
(beginning atstart
(and terminating beforeend
))). - trie.value(
string
,start=0
,end=None
,default=NULL
) - Return the value of the longest key that is a prefix of
string
(beginning atstart
and ending atend
).If no key matches, raise a KeyError or return thedefault
value if it was set. - trie.values([
string
[,start
[,end
]]]) - Return all values (for keys that are a prefix of
string
(beginning atstart
(and terminating beforeend
))).
Initial release.
Update: Full documentation and corrections.
Feature: optional keyword parameters to indicate an offset
start
when scanning a string with the methods key(), keys(), item(), items(), value(), and values(), so it is not necessary to slice strings for each scan:>>> # Old usage to scan 'string' in 'the string' was: >>> T.keys('the string'[4:]) >>> # With the new optional keyword parameter: >>> T.keys('the string', start=4)
Important API change: item() now returns key, value pairs even when a default value is given, using
None
as the "key":>>> # Old behaviour was: >>> T.item('string', default=False) False >>> # While now, the same call produces: >>> T.item('string', default=False) None, False
Improvement: Switched from using dictionaries to two-tuple lists internally (thanks to Pedro Gaio for the suggestion!) to improve the overall performance a bit (about 20% faster on simple tests).
Bugfix: When splitting edges while adding a new key that is shorter than the current edge, a index error would have occurred.
Feature: Added optional keyword parameter
end
to the methods key(), keys(), item(), items(), value(), and values(), so it is not necessary to scan within a window:T.key('string', start=2, end=3, default=None) T.keys('string', start=2, end=3)
Improvement: Switched back to a very efficient internal dictionary implementation; Runs about two- to three times as fast as the two-tuple list from update 4 against the simple (and newly added)
time_patricia.py
"benchmark".Bugfix: Correct behavior when using a negative start index. Added a comparison to marisa-trie - by now, it seems, patricia-trie is roughly only a factor two slower than the marisa-trie PyPI version wrapping a C library. Also makes it nice to compare the two usages.
Bugfix (15/09/2014): Correct behaviour when using an exactly matching prefix as query (issue described in #1 by @zachrahan). Also fixes code-smells (PEP8, code complexity) and a failing test case code.
Bugfix (14/12/2014): Added the missing README to PyPI package. (MANIFEST.in)
Copyright 2013, Florian Leitner. All rights reserved.