An n-gram database written in Go, optimized for write once read many use.
Gloomy supports the following text formats (specified via sourceType conf. value):
- vertical format.
- plain text files
gloomy -ngram-size 3 create-index ./config.json
where config.json looks like this:
{
"inputFilePath": "/path/to/a/vertical/file",
"sourceType": "vertical",
"filterArgs": [],
"ngramIgnoreStructs": [],
"ngramStopStrings": [".", ":"],
"ngramIgnoreStrings": ["\"", ","],
"tmpDir": "/tmp/gloomy",
"procChunkSize": 1000000,
"outDirectory": "/path/to/an/output/directory",
"args": {
"doc.file": "col8",
"doc.n": "col8",
"head.type": "col8"
}
}
In the searching mode, a gloomy.conf file (by default in the working directory) is expected:
{
"dataPath": "/path/to/indices/data",
"serverPort": 8090,
"serverAddress": "127.0.0.1"
}
gloomy search corpname phrase
Start a server:
gloomy search-service
Test a client:
curl -XGET http://localhost:8090/search?corpus=susanne&q=from
The current version supports only a search by the first token.
Exact search:
gloomy search susanne absolute
... searches for all the n-grams with the first token equal to absolute.
Search by a prefix:
gloomy search susanne abs*
... searches for all the n-grams where the first token starts with abs*
Search by a regular expression:
gloomy search -qtype regexp susanne "dogs?"
Please note that Gloomy's support of regular expressions is limited:
- . (dot), [abc], a?, a.*, a+, (foo)
- no character groups (e.g. \w, \s)
- alternation (the | operator) behaves differently - it has the highest priority:
- foo|bar translates into either fooar or fobar
- use (foo)|(bar) to get either foo or bar
Command line:
gloomy search --attrs doc.file,doc.n susanne absolute
In HTTP server mode use multi-value attribute:
http://localhost:8090/search?corpus=susanne&q=from&attrs=doc.file&attrs=doc.n
http://localhost:8090/search?corpus=susanne&qtype=regexp&q=dogs%3F&attrs=doc.file&attrs=doc.n
inputFilePath - path to a source file in a plain text or zipped plain text format
sourceType - plain/vertical
filterArgs - a CNF encoded set of rules applied to structural attributes a a filter
ngramIgnoreStructs - a list of structs to ignore
ngramStopStrings - a list of strings to end an n-gram (typically: ".", "!" etc.)
ngramIgnoreStrings - a list of strings to be completely ignored
tmpDir - a directory where Gloomy may store temporary data when dealing with large data; the directory may not exist - Gloomy will create it if needed
procChunkSize - number of ngrams per temporary chunk file when dealing with large data
outDirectory - output directory
args - structural attributes to be imported
To filter specific ngrams out Gloomy offers a way how to call a custom external function testing current n-gram (and its possible PoS tag companion).
package main
import (
"regexp"
"github.com/tomachalek/gloomy/index/builder/filter"
)
var (
tagPattern = regexp.MustCompile("^.{14}8.")
)
func filterF1(words []string, tags []string) bool {
return !tagPattern.MatchString(tags[i])
}
var FilterF1 = filter.CustomFilter(filterF1)
Compile the function(s) with
go build -buildmode=plugin
Then upgrade your config json:
{
"inputFilePath": "/path/to/a/vertical/file",
"...": "...",
"ngramFilter": {
"lib": "/path/to/your/filter.so",
"fn": "FilterF1"
},
}
It is possible to just extract n-grams to a raw text file instead of building an index:
gloomy -ngram-size 3 extract-ngrams ./config.json