SGNS results #7

Alaa-Ebshihy · 2018-01-10T12:45:18Z

Hi,

I have a problem in re-generating SGNS embeddings on google ngram corpus

I follow these steps:

use histwords/googlengram/pullscripts/posgrab.py to generate counts for 1-gram
use histwords/googlengram/pullscripts/downloadandsplit.py then histwords/googlengram/pullscripts/gramgrab.py (set context to 4)
use histwords/googlengram/pullscripts/runmerge.py on the output from 2 and then histwords/googlengram/pullscripts/indexmerge.py
use histwords/googlengram/freqperyear.py on the output of 3
use histwords/googlengram/makedecades.py on the output of 3
use histwords/sgns/makecorpus.py py passing the output of 1, 4 and 5
train embeddings using histwords/sgns/runword2vec.py (using --sequential option)
use histwords/sgns/postprocessingsgns.py on the trained data.

My problem is that the vectors generated is not the same as pre-trained vectors on http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip. The size of vocabulary is about 50000 while yours 100000

So, my question are there wrong in the steps I follow? or can you help me with any info why this happens

Thanks,

Provide feedback