IPADIC NEologd dictionary builder for Lindera. This project fork from fulmicoton's kuromoji-rs.
% cargo install lindera-ipadic-neologd-builder
The following products are required to build:
- Rust >= 1.46.0
- mecab >= 0.996 (for building a dictionary)
% cargo build --release
This repository only tested data of mecab-ipadic-NEologd.
NOTE : This builder skip 2 words, カブシキガイシャ
and タカラヅカカゲキダンキセイ
, to avoid dictionary build failure.
These words are SKIP_WORDS
in src/lib.rs
.
Building a dictionary with lindera-ipadic-neologd
command:
% curl -L https://github.com/neologd/mecab-ipadic-neologd/archive/master.zip > ./mecab-ipadic-neologd-master.zip
% unzip -o mecab-ipadic-neologd-master.zip
% ./mecab-ipadic-neologd-master/bin/install-mecab-ipadic-neologd --create_user_dic -p $(pwd)/mecab-ipadic-neologd-master/tmp -y
% IPADIC_VERSION=$(find ./mecab-ipadic-neologd-master/build/mecab-ipadic-*-neologd-* -type d | awk -F "-" '{print $6"-"$7}')
% NEOLOGD_VERSION=$(find ./mecab-ipadic-neologd-master/build/mecab-ipadic-*-neologd-* -type d | awk -F "-" '{print $NF}')
% lindera-ipadic-neologd ./mecab-ipadic-neologd-master/build/mecab-ipadic-${IPADIC_VERSION}-neologd-${NEOLOGD_VERSION} lindera-ipadic-${IPADIC_VERSION}-neologd-${NEOLOGD_VERSION}
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 品詞 | part-of-speech | |
1 | 品詞細分類1 | sub POS 1 | |
2 | 品詞細分類2 | sub POS 2 | |
3 | 品詞細分類3 | sub POS 3 | |
4 | 活用形 | conjugation type | |
5 | 活用型 | conjugation form | |
6 | 原形 | base form | |
7 | 読み | reading | |
8 | 発音 | pronunciation |
You can tokenize text using produced dictionary with lindera
command:
% echo "羽田空港限定トートバッグ" | lindera -d ./lindera-ipadic-2.7.0-20070801-neologd-20200130
羽田空港 名詞,固有名詞,一般,*,*,*,羽田空港,ハネダクウコウ,ハネダクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ 名詞,固有名詞,一般,*,*,*,トートバッグ,トートバッグ,トートバッグ
EOS
For more details about lindera
command, please refer to the following URL:
The API reference is available. Please see following URL: