Requirements:
pandas
numpy
jieba
scipy
nltk
This implementation tries to discover four types of new words based on four parameters.
Four types of new words:
-
latin words, including
-
pure digits (2333, 12315, 12306)
-
pure letters (iphone, vivo)
-
a mixture of both (iphone7, mate9)
-
-
2-Chinese-character unigram (unigrams are defined as the elements produced by the segmentator):
(马蓉,优酷,杨洋)
-
3-Chinese-character unigram:
(李易峰,张一山,井柏然)
-
bigrams, which are composed of two unigrams:
(图片大全,英雄联盟,公交车路线,穿越火线)
Four parameters:
-
term frequency (tf): The occurrences of a word. A larger
tf
indicates a larger confidence of the following 3 paramters. -
aggregation coefficient: A larger
agg_coef
indicates a larger possibility of the co-occurrence of the two words.
where C(w_1, w_2)
indicates the counts of the pattern that w_1
is followed by w_2
.
C(w_1)
and C(w_2)
indicate the count of the counts of w_1
and w_2
respectively.
-
minimum neighboring entropy
-
maximum neighboring entropy
The minimum and maximum neighboring entropy are the minimum and maximum of left neighboring entropy and right neighboring entropy respectively.
A larger neighboring entropy of a word w
indicates that w
collocates with mores possible words, which in turn indicates that w
is an independent word. For instance, "我是" has a large tf
and a large agg_coef
but a small minimum neighboring entropy
so it's not a word.
left entropy:
where w_l
are the set of unigrams that appear to the left of word w
. This above-mentioned formula also applies to the right neighboring entropy.
An execution script example (Note that the double quotes cannot be omitted if the path you provided contains spaces):
python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2
Run
python run_discover.py --help
for further information and help.
Each iteration includes the following 11 steps:
- cutting
- counting characters
- counting unigrams
- counting bigrams
- counting trigrams
- calculating aggregation coefficients (for unigrams)
- counting neighboring words (for unigrams)
- Calculating boundary entropy (for unigrams)
- calculating aggregation coefficients (for bigrams)
- counting neighboring words (for bigrams)
- calculating boundary entropy (for bigrams)
After each iteration, you will get four files reporting new words of type latin, 2-Chinese-character words, 3-Chinese-character words and bigram respectively. After the program exits, you will get four files which respectively merge each type of new words generated from each iteration.
If you encounter any problems, feel free to open an issue or contact me (rayarrow@qq.com).
====================================分隔线================================
根据四个参数发现四种类型的新词。
四种类型的新词:
-
拉丁词,包括:
-
纯数字 (2333, 12315, 12306)
-
纯字母 (iphone, vivo)
-
数字字母混合 (iphone7, mate9)
-
-
两个中文字符的unigram (unigrams被定义为分词器产生的元素):
(马蓉,优酷,杨洋)
-
三个中文字符的unigram unigram:
(李易峰,张一山,井柏然)
-
bigrams, 每个bigram由两个unigram组成
(图片大全,英雄联盟,公交车路线,穿越火线)
四个参数:
-
词频 (tf): 一个词出现的次数。词频越大,表明下面三个参数的置信度越高。
-
凝聚系数: 凝聚系数越大表明两个(字)词共同出现的概率越大(越不是偶然)。
其中C(w_1, w_2)
是词w_1
和w_2
共同出现的次数。
C(w_1)
和C(w_2)
是词w_1
和w_2
分别出现的次数。
-
最小边界信息熵
-
最大边界信息熵
最小和最大边界信息熵分别是左边界信息熵和右边界信息熵二者的最小值和最大值。
边界信息熵越大,表明一个词越能和更多词搭配,进而表明一个词是一个独立词。比如"我是"拥有大词频和大凝聚系数但是最小边界信息熵却很小,说明它不是一个词。
左边界信息熵:
其中w_l
是出现在w
左边的所有unigram组成的集合,上面的公式同样适用于右边界信息熵的计算。
其中一个运行示例(注意如果路径中有空格那么两端的双引号不可省略)
python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2
运行
python run_discover.py --help
来获取更多帮助。
每次迭代包含以下11个步骤:
- cutting
- counting characters
- counting unigrams
- counting bigrams
- counting trigrams
- calculating aggregation coefficients (for unigrams)
- counting neighboring words (for unigrams)
- Calculating boundary entropy (for unigrams)
- calculating aggregation coefficients (for bigrams)
- counting neighboring words (for bigrams)
- calculating boundary entropy (for bigrams)
每次迭代之后会产生4个文件分别报告拉丁新词,两个中文的unigram新词,三个中文的unigram新词和bigram新词。程序运行结束后,你会额外得到4个文件,每个文件是一个类型的新词,由之前每次迭代的结果综合而成。
如果遇到任何问题,欢迎提出issue或者联系我 (rayarrow@qq.com).