README.ja.md is written in Japanese.
neologd-solr-elasticsearch-synonyms is Japanese noun synonyms file which is written in Solr synonyms format.
This synonyms file includes many orthographic variant strings of nouns, which are common orthographic variant strings with mecab-ipadic-NEologd.
When you want to define synonyms of common nouns, neologisms and Named Entities, to try to apply this synonyms file is one of better choice.
- Recorded about 65,500 pairs (mapping from about 0.33 million tokens) of orthographic variant of nouns.
- One of the largest Japanese OSS synonyms file
- Update process of this synonyms file will semi-automatically run on development server.
- I'm planning to update this synonyms file at the timing of updating a orthographic variant dictionary of mecab-ipadic-NEologd.
- The location of symbolic link to recent synonyms file is permanent
- When you will updating synonyms file, it's unnecessary to update a configure file of search server.
- Can't distinguish between orthographic variant of nouns and same spelling homonyms.
- Not support other synonyms file format.
This synonyms file can use with following open source search server.
- Elasticsearch
OR
- Solr
A synonyms file will distribute via GitHub repository.
In first time, you should execute the following command to 'git clone'.
$ git clone --depth 1 https://github.com/neologd/neologd-solr-elasticsearch-synonyms.git
OR
$ git clone --depth 1 git@github.com:neologd/neologd-solr-elasticsearch-synonyms.git
If you need all log of mecab-ipadic-neologd.git, you should clone the repository without '--depth 1'
Move to neologd-solr-elasticsearch-synonyms directory which was cloned in the above preparation.
$ cd neologd-solr-elasticsearch-synonyms
You can install or can update(overwritten) the recent neologd-solr-elasticsearch-synonyms by following command.
$ ./bin/install-neologd-solr-elasticsearch-synonyms -n
You should check content of neologd-synonyms.txt.
Default install location is neologd-solr-elasticsearch-synonyms/synonyms directory.
$ cd neologd-solr-elasticsearch-synonyms
$ cat synonyms/neologd-synonyms.txt | grep "お好み焼き"
お好み焼き, おこのみやき, おこのみ焼, おこのみ焼き, お好みやき, お好み焼, お好やき, お好焼,\
お好焼き, オコノミヤキ, オコノミ焼, オコノミ焼キ, オ好ミヤキ, オ好ミ焼, オ好ミ焼キ,\
オ好ヤキ, オ好焼, オ好焼キ
Our installer creates a permanent symbolic link to recent synonyms file(neologd-synonyms.YYYYMMDD.txt).
$ ls -al neologd-solr-elasticsearch-synonyms/synonyms
合計 5756
drwxrwxr-x 2 overlast overlast 4096 2月 10 18:26 2016 .
drwxrwxr-x 8 overlast overlast 4096 2月 10 19:00 2016 ..
-rw-rw-r-- 1 overlast overlast 5878409 2月 10 18:26 2016 neologd-synonyms.20160209.txt
lrwxrwxrwx 1 overlast overlast 99 2月 10 18:26 2016 neologd-synonyms.txt -> /any/where/neologd-solr-elasticsearch-synonyms/bin/../synonyms/neologd-synonyms.20160209.txt
If you want to install recent synonyms file to optional location, you can use "-p" option.
$ ./bin/install-neologd-solr-elasticsearch-synonyms -n -p /absolute/path/where/you/want/to/install
You can check useful command line option using "-h" option.
$ ./bin/install-mecab-ipadic-neologd -h
When you want to use neologd-solr-elasticsearch-synonyms, you should set the path of synonyms file as a value of synonyms_path property of Elasticsearch/Solr.
When you will updating synonyms file, it's unnecessary to update a configure file of search server, because the location of symbolic link to recent synonyms file is permanent.
And you should set boolean value(true/false) to a value of 'expand' attribute explicitly.
If format of a synonym file is CSV and a value of 'format' attribute is null, Elasticsearch and Solr will generate mappings between each strings in a synonym entry using addInternal() method of SolrSynonymParser class of Lucene.
There are two mapping methods. You should select a mapping method using 'expand' attribute.
In the following, we show the example of mapping result for a case of a synonym entry is 'お好み焼き,お好み焼,お好焼'.
A value of 'expand' attribute | An entry of synonym file(CSV format) | Generated pairs |
---|---|---|
true | お好み焼き,お好み焼,お好焼 | [お好み焼き=>お好み焼, お好み焼き=>お好焼, お好み焼=>お好み焼き, お好み焼=>お好焼, お好焼=>お好み焼き, お好焼=>お好み焼] |
false | お好み焼き,お好み焼,お好焼 | [お好み焼き=>お好み焼き, お好み焼=>お好み焼き, お好焼=>お好み焼き] |
In a case of 'expand = true', addInternal() will generates all pairs of values of synonym entry.
In a case of 'expand = false', addInternal() will generates combination a value of first column and each value of all columns.
We develop neologd-solr-elasticsearch-synonyms on the assumption that the value of 'expand' attribute will mainly be 'false'.
In the following, we show code examples of a configure file for loading a synonym file which can load on Elasticseach or Solr with setting false to a value of 'expand' attribute.
{
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "kuromoji_neologd_tokenizer",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"expand": "false",
"synonyms_path" : "/absolute/path/of/neologd-synonym.txt"
}
}
}
}
}
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="/absolute/path/of/neologd-synonym.txt" ignoreCase="true" expand="false">
</analyzer>
</fieldType>
Please use the following bibtex, when you refer mecab-ipadic-NEologd from your papers.
@misc{sato2016neologdsolrelasticsearchsynonym
title = {neologd-solr-elasticsearch-synonyms - Japanese noun synonyms file for Elasticsearch and Solr},
author = {Toshinori, Sato},
url = {https://github.com/neologd/neologd-solr-elasticsearch-synonyms},
year = {2015}
}
Please star this github repository if mecab-ipadic-NEologd is very useful to your project ;)
This project is depending to other OSS projects. Please check following link.
Copyright (c) 2015-2016 Toshinori Sato (@overlast) All rights reserved.
We select the 'Apache License, Version 2.0'. Please check following link.