prepro.py , `to_id` function assigns id using tokens in BOTH train and dev? #8

unilight · 2018-02-28T18:41:30Z

Here's a code snippet of prepro.py:

full = train + dev
vocab, counter = build_vocab([row[5] for row in full], [row[1] for row in full])
w2id = {w: i for i, w in enumerate(vocab)}

def to_id(row, unk_id=1):
    context_tokens = row[1]
    context_features = row[2]
    context_tags = row[3]
    context_ents = row[4]
    question_tokens = row[5]
    question_ids = [w2id[w] if w in w2id else unk_id for w in question_tokens]
    context_ids = [w2id[w] if w in w2id else unk_id for w in context_tokens]
    ...

train = list(map(to_id, train))
dev = list(map(to_id, dev))

If my interpretation is right, this means that when processing the dev set, the FULL vocab set (constructed from train+dev) is used to determine if words in dev set are UNK. Shouldn't it be using vocab constructed from the train set only?
Let me know if my interpretation is right :)

The text was updated successfully, but these errors were encountered:

hitvoice · 2018-03-01T11:53:26Z

This will not affect the fairness of the dev set evaluation. Here's why:

Whether a word is UNK is completely determined by whether it appears in GloVe, which does not include any information in the dev set.
A word vector may be finetuned during training. But if a word which appears in the dev set does not appear in the training set, its word vector will remain the same.

Since the information in the dev set does not affect the training process, this processing is fair. Please share your opinion if my thoughts were wrong:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepro.py , `to_id` function assigns id using tokens in BOTH train and dev? #8

prepro.py , `to_id` function assigns id using tokens in BOTH train and dev? #8

unilight commented Feb 28, 2018

hitvoice commented Mar 1, 2018

prepro.py , to_id function assigns id using tokens in BOTH train and dev? #8

prepro.py , to_id function assigns id using tokens in BOTH train and dev? #8

Comments

unilight commented Feb 28, 2018

hitvoice commented Mar 1, 2018

prepro.py , `to_id` function assigns id using tokens in BOTH train and dev? #8

prepro.py , `to_id` function assigns id using tokens in BOTH train and dev? #8