Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

怎么训练模型 #10

Open
huan415 opened this issue Oct 20, 2022 · 2 comments
Open

怎么训练模型 #10

huan415 opened this issue Oct 20, 2022 · 2 comments

Comments

@huan415
Copy link

huan415 commented Oct 20, 2022

Word2Vec.trainJavaModel("data/train.txt", "data/test.model");

你好, data/train.txt 和 data/test.model 能给个样例吗。

例如:我有10句话,分词之后,在train.txt是什么样子的。
把相近的词空格分开,放到同一行? 还是10句话,一句一行,词用空格

@jsksxs360
Copy link
Owner

你好,data/test.model 是训练好之后保存的模型路径。data/train.txt 是分好词的训练语料,一行是一个文本,每个文本都是用空格分隔的词语,例如:

doc1_word1 doc1_word2 doc1_word3...
doc2_word1 doc2_word2 doc2_word3...
...

@jsksxs360
Copy link
Owner

我更建议直接使用 Google 官方的代码来训练模型,是目前公认的准确率最高的 word2vec 版本,与使用 Java 版训练得到的模型格式是完全相同的,后面也可以使用本库加载。可以参见:

训练 Google 版模型
维基百科中文语料库词向量的训练:处理维基百科中文语料

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants