HashSplitter plugin is a N-Gram tokenizer generating tokens that are not overlapping and are prefixed.
In order to install the plugin, simply run: bin/plugin -install yakaz/elasticsearch-analysis-hashsplitter/0.2.0
.
-------------------------------------------------
| HashSplitter Analysis Plugin | ElasticSearch |
-------------------------------------------------
| master | 0.19 -> master |
-------------------------------------------------
| 0.2.0 | 0.19 -> master |
-------------------------------------------------
| 0.1.0 | 0.19 -> master |
-------------------------------------------------
It supports a wide variety of requests such as:
- exact match
- query by analyzed (prefixed) terms
- wildcard query
- range query
- prefix query
Here's a concrete example of the analysis performed:
chunk_length: 4
prefixes: ABCDEFGH
input: d41d8cd98f00b204e9800998ecf8427e
output:
- Ad41d
- B8cd9
- C8f00
- Db204
- Ee980
- F0998
- Gecf8
- H427e
It is aimed at making hashs (or any fixed length value splittable in equally sized chunks) partially searchables efficiently, without having a plain wildcard query enumerate tons of terms. It can also help reduce the index size.
However, depending on your configuration, if you do not wish to search for wildcard queries, you may experience slightly decreased performance. See http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html for more information.
The plugin provides:
hashsplitter
field typehashsplitter
analyzerhashsplitter
tokenizerhashsplitter
token filterhashsplitter_term
query/filter (same syntax as the regularterm
query/filter)hashsplitter_wildcard
query/filter (same syntax as the regularwildcard
query/filter)
The plugin also provides correct support of the hashsplitter
field type for the standard:
- field query/filter (used by the
term
query/filter) - prefix query/filter
- range query/filter
The plugin does not support:
- fuzzy query/filter
The plugin cannot currently support (as of ElasticSearch 0.19.0):
- term query/filter: This gets mapped to a field query by ElasticSearch. Use the
hashsplitter_term
query instead.
Note that a query_string
query calls the field, prefix, range and fuzzy capability of the hashsplitter
field automatically.
But make sure you actually use the hashsplitter
field type and direct the query to that field (and not the _all
field for eg.).
It is recommended that you use the hashsplitter
field type, this will enable custom querying easily.
It is also the only way of using the field, prefix and range queries/filters.
The alternative would be to use the hashsplitter
analysis on the field, and be spent extra care to the way you query the field.
Here is a sample mapping (in config/mapping/your_index/your_mapping_type.json
):
{
"your_mapping_type" : {
properties : {
[...]
"your_hash_field" : {
type: "hashsplitter",
settings : {
chunk_length: 4,
prefixes: "abcd",
size: 16,
wildcard_one: "?",
wildcard_any: "*"
}
},
[...]
}
}
}
This will define the your_hash_field
field within the your_mapping_type
as having the hashsplitter
type.
Notice the unusual settings
section. It will be parsed by the plugin in order to configure the tokenization according to your needs.
chunk_length
: The length of the chunks generated by the analysis. The input "0123456789" with achunk_length
of 2 will be split into[01, 23, 45, 67, 89]
, with achunk_length
of 3 it will be split into[012, 345, 678, 9]
. Note that the last chunk can be shorter thanchunk_length
characters.prefixes
: The positional prefixes to prepend to each chunk. Each individual character in the given string will be used, in turn. The chunks[000, 111, 222, 333]
with a"abc"
prefixes
will generate the following terms:[a000, b111, c222, a333]
. Note how it wraps if there are not enough prefix character available. You want to avoid this as is will makea000
anda333
indistinguishable.size
: How long are the input hash supposed to be, as an integer, or"variable"
. This won't prevent bad values from being analyzed at all. This information is solely used by the wildcard query/filter in order to expand*
s properly.wildcard_one
: Which character to be used as a single character wildcard. A single character string. This may help you if the default?
is a genuine input character. It is solely used in the wildcard query/filter.wildcard_any
: Which character to be used a a any string wildcard. A single character string. This may help you if the default*
is a genuine input character. It is solely used in the wildcard query/filter.
All parameters are optional, so is the settings
section.
chunk_length
: 1prefixes
:"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,."
size
:"variable"
wildcard_one
:"?"
wildcard_any
:"*"
Those analysis components will merely split the input into chunks of fixed size and prefix them. Each of these have 2 parameters that you want to define in the configuration.
Here is a sample configuration (in config/elasticsearch.yml
):
index.analysis:
analysis:
analyzer:
your_hash_analyzer:
type: hashsplitter
chunk_length: 4
prefixes: ABCDEFGH
tokenizer:
your_hash_tokenizer:
type: hashsplitter
chunk_length: 4
prefixes: ABCDEFGH
filter:
your_hash_tokenfilter:
type: hashsplitter
chunk_length: 4
prefixes: ABCDEFGH
This will configure both an analyzer, a tokenizer and a token filter (all of them being separate).
You can then create your custom analyzer using the newly configured tokenizer or/and token filter.
Note that that custom analyzer will have type: custom
.
chunk_length
prefixes
See hashsplitter
field type parameters for more information.
{
term : {
your_hash_field: "d41d8cd98f00b204e9800998ecf8427e"
}
}
Note: The length is not checked. However, if your field values are always of the same fixed length and your query value is of that same size too, then your safe.
You will need to understand how this query works in order to clarify this warning a bit.
The same analysis is performed when indexing the field, and when processing this query. The searched term will get split into terms, which will be merely AND-ed together. Hence, any additional terms (a longer field value) won't prevent the match.
However, if the last term chunk is not of the correct size, no match will occur! (eg: "d41d8"
would generate the query +Ad41d +B8
and B8
will never match.)
Positive side-effect: If the field length is not a multiple of the chunk length, then the match will only include same-length hashes, as a longer hash would have a longer (hence different, non matching) last term.
Do not use the default term
query, as unlike stated in the docs, the provided term is analyzed, hence the provided value gets chunked and prefixed and AND-ed.
{
hashsplitter_term : {
your_hash_field: "H427e"
}
}
This query allows you to match the generated terms exactly. No analysis is performed. A pure TermQuery
is generated with the given field and term.
This query is the only way to express yourself the prefix along with the chunk value to be queried.
Note that in the default term
query, as unlike stated in the docs, the provided term is analyzed, hence the provided value gets chunked and prefixed and pieces are AND-ed together.
{
prefix : {
your_hash_field: "d41d8"
}
}
Assuming chunk_length
= 4, this will generate the query +A41d8 +PREFIX
, where PREFIX
is the prefix query B8*
, filtered to only return terms whose size is between 2 and 5 (or equal to the resting size, if size is fixed).
{
range : {
your_hash_field : {
from: "d4000000000000000000000000000000",
include_lower: true,
to: "d4200000000000000000000000000000",
include_upper: false
}
}
}
The generated range queries will be optimized to only query the terms at the required level, like Lucene NumericRangeQuery does. (With the difference that in Lucene the whole term upto the cut level is included, whereas we only include a middle chunk without the previous ones).
The lexicographical ordering of terms will be used. The used prefixes wont have any influence but the length of the terms will. For instance the range [d400 TO d42]
(both inclusive) will match d400 0000 0000 ...
but not d420 0000 0000 ...
(space added to visualise the generated chunks), because d42
sorts before d420
, hence the latter is not included within the range.
{
hashsplitter_wildcard : {
your_hash_field : "d41?8*27e"
}
}
Note: The ?
and *
wildcards must match the one configured in the field type mapping (these are the default values).
The *
wildcard is restricted to one usage per pattern, and text may appear after it if and only if the field type mapping
uses a fixed size. Using *
at the end is always possible and equates to a prefix query with eventual ?
wildcards.
This restriction arise from the fact that prefixes are used to “locate” chunks, hence all character un the pattern must be located precisely. Using more that one *
makes it impossible to deterministically perform this localisation. A simple fallback will however be used: the particular case where all *
match a zero length string. But this is likely to be of no help...
As long as you query against your_hash_field
(the field of type hashsplitter
), the generated queries should function like the one described above.
Sophisticated queries can often create multiple of the above queries, as they use complex lexical analysis that express combinations of multiple queries (eg. "+ANDed_token ORed_token -NOT_token [from_token TO to_token]"
).
Note that the default wildcard
query won't function in the intended way. Don't use them through sophisticated queries through analyze_wildcard
.