Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DrQA on an Chinese dataset #22

Open
kaihuchen opened this issue Oct 3, 2018 · 3 comments
Open

Using DrQA on an Chinese dataset #22

kaihuchen opened this issue Oct 3, 2018 · 3 comments

Comments

@kaihuchen
Copy link

Is it expected that this code can be applied to a Chinese language dataset with only minor changes?

I understand that I will need to provide the following:

  • Chinese train/dev data files in the SQuAD format
  • GloVe word vectors trained on the Chinese language
  • Spacy Chinese language models
  • Changes in prepro.py to take care of things such as tokenization, add encoding="utf8" to file read/write statements, etc.

Would very much appreciate any insights if there is any known reasons why this is not supposed to work.

@hitvoice
Copy link
Owner

hitvoice commented Oct 3, 2018

Yes, there should be only those modifications you mentioned. The most tricky part should be the Chinese SpaCy models, which are not officially supported.

@kaihuchen
Copy link
Author

@hitvoice Much appreciated for the confirmation!
One more question: given that the Chinese language does not have natural word boundaries, when using DrQA with a Chinese language dataset, does it make any difference for DrQA whether the dataset is tokenized first (i.e., 分词,using a tool such as Jieba)? Or can I assume that since SpaCy kind of does tokenization in its own way, so I actually don't have to anything specially in this respect?

@hitvoice
Copy link
Owner

You should tokenize your Chinese data first. Prepare your data as "这是 一个 分词 后 的 样例" (separate tokens by spaces) and provide corresponding POS and NER tags. This not an easy copy-and-paste. A lot of work and modifications should be done for Chinese support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants