Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepro.py , to_id function assigns id using tokens in BOTH train and dev? #8

Open
unilight opened this issue Feb 28, 2018 · 1 comment

Comments

@unilight
Copy link

Here's a code snippet of prepro.py:

full = train + dev
vocab, counter = build_vocab([row[5] for row in full], [row[1] for row in full])
w2id = {w: i for i, w in enumerate(vocab)}

def to_id(row, unk_id=1):
    context_tokens = row[1]
    context_features = row[2]
    context_tags = row[3]
    context_ents = row[4]
    question_tokens = row[5]
    question_ids = [w2id[w] if w in w2id else unk_id for w in question_tokens]
    context_ids = [w2id[w] if w in w2id else unk_id for w in context_tokens]
    ...

train = list(map(to_id, train))
dev = list(map(to_id, dev))

If my interpretation is right, this means that when processing the dev set, the FULL vocab set (constructed from train+dev) is used to determine if words in dev set are UNK. Shouldn't it be using vocab constructed from the train set only?
Let me know if my interpretation is right :)

@hitvoice
Copy link
Owner

hitvoice commented Mar 1, 2018

This will not affect the fairness of the dev set evaluation. Here's why:

  1. Whether a word is UNK is completely determined by whether it appears in GloVe, which does not include any information in the dev set.
  2. A word vector may be finetuned during training. But if a word which appears in the dev set does not appear in the training set, its word vector will remain the same.

Since the information in the dev set does not affect the training process, this processing is fair. Please share your opinion if my thoughts were wrong:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants