Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 2.48 KB

File metadata and controls

68 lines (50 loc) · 2.48 KB

Recurrent Convolutional Neural Networks for Chinese Question Classification on BQuLD

A deep learning-based Chinese question classifier (Keras implementation) on BQuLD

Contents

Model Architecture Overview

Alt text

For more details Click Here.

Bilingual Question Labelling Dataset (BQuLD)

This dataset is a bilingual (traditional Chinese & English) question labelling dataset designed for NLP researchers.
The questinon type definition is borrowed from Intelligent Agent Systems Lab: Alt text

The dataset originally consists of 1216 pairs of question and question label, which first published by the author of this GitHub tim5go
There are 9 question types in total, namely:

  1. NUMBER
  2. PERSON
  3. LOCATION
  4. ORGANIZATION
  5. ARTIFACT
  6. TIME
  7. PROCEDURE
  8. AFFIRMATION
  9. CAUSALITY

Embedding Preparation

In my experiment, I built a word2vec model on 全網新聞數據(SogouCA) Sogou Labs

For example, in Linux:

  1. clean XML tag
$ cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" 
  | sed 's\<content>\\' | sed 's\</content>\\' > corpus.txt
  1. word segmentation using LTP command line
$ cws_cmdline --threads 4 --input corpus.txt --segmentor-model cws.model > corpus.seg.txt
  1. simplified to traditional Chinese conversion using OpenCC
$ opencc -i corpus.seg.txt -o corpus_trad.txt -c s2t.json
  1. word2Vec training using Google Word2vec
$ nohup ./word2vec -train corpus_trad.txt -output sogou_vectors.bin -cbow 0 
  -size 200 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 24 -binary 1 -iter 20 -min-count 1 &

Result

Training Loss Training Accuracy Validation Loss Validation Accuracy
0.7000 87.11% 0.8945 77.87%