Phonetic Word Similarity

A novel method to compare the phonetic similarity between words based on phonetic features. This is the official repository for the paper https://arxiv.org/pdf/2109.14796.pdf

Table of content

Preparing dataset and environment

Downloading

Download The CMU Pronouncing Dictionary in the data directory.

wget -P data http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b

Download SOTA model vocab from NLP for Hindi git repo.

wget -O data/hindi_lm_large.vocab https://drive.google.com/uc?export=download&id=1P6r8UBcegvVmr1kBDjqcYppmt_WgnbNt

Preparing

Add missing words to cmu dictionary

cat data/cmudict-0.7b res/cmudict_missing_words >> data/cmudict-0.7b-with-vitz-nonce

Install all the dependencies.

pip install -r src/requirements.txt

Generate hindi dictionary from LM vocab

python src/preprocess/vocab2dict.py res/hindi_phones.csv data/hindi_lm_large.vocab data/dict_hindi

Algorithm results

results_method.ipynb contains results for the algorithm. The result includes:

Comparision between unigram, bigram, bigram with penalty and bigram with penalty & vowel weight.

How we obtained the penalty of 2.5.

Comparision between Vitz and Winkler (1973), Parrish's Embeddings (2017), and our methods (with and without vowel weights).

^ The Parrish's Embeddings (PSSVec) results are generated from the author's provided git code using numpy.seed(0) in generate.py. We can not use author provided pretrained vectors because the dictionary used by them misses a word BELATION used in the RELATION dataset by Vitz and Winkler (1973).

The similarity vectors used by us for calculating PSSVec can be downloaded using

wget -O data/cmudict-0.7b-simvecs https://drive.google.com/uc?export=download&id=1gCvwI8ldxGM52vCoN70wUKmJfFMdapNl

Train embedding

Embedding scores can be re-generated using src/embedding.py by providing the learned embedding file and the output file.

python src/embedding.py data/cmudict-0.7b-simvecs res/PSSVec_results.csv
python src/embedding.py embedding_english/simvecs res/embedding_score.csv

^ These files are used to generate scores in the result section using results_method.ipynb.

Embedding results

TSNE Plot for some English words

TSNE Plot for some Hindi words

Pun Dataset (see docs/puns.md for more details)

Docker

Docker supported for development and training.

Building docker image

make build

Running an interactive docker container.

make develop

This will give you a command prompt inside the docker. Current directory will be mounted at /workspace. The container will be destroyed on exit but all the files and changes done in the directly will persist.

You can also start it with GPU support:

make develop_gpu

Removing the image.

make clean

Remember this will not delete the base image. To clean the base image run:

make clean_base

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Hat tip to anyone whose code was used

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
embedding_english		embedding_english
embedding_hindi		embedding_hindi
res		res
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
ReadMe.md		ReadMe.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phonetic Word Similarity

Preparing dataset and environment

Downloading

Preparing

Algorithm results

Train embedding

Embedding results

Docker

Building docker image

Running an interactive docker container.

Removing the image.

License

Acknowledgments

About

Releases

Packages

Languages

License

KunalDhawan/phonetic-word-embedding

Folders and files

Latest commit

History

Repository files navigation

Phonetic Word Similarity

Preparing dataset and environment

Downloading

Preparing

Algorithm results

Train embedding

Embedding results

Docker

Building docker image

Running an interactive docker container.

Removing the image.

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages