Skip to content

Complex Word Identification (CWI) for French medical documents

Notifications You must be signed in to change notification settings

KimChengSHEANG/MedCWI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install dependencies

Test on Python 3.7.5 with pyenv

pip install -r requirements.txt

CNN model

To train the CNN model, simply running the follow script.

python scripts/train_cnn.py

If you run this for the first time, it will around 30mn (not include training) for downloading resources and preprocessing.

To customize the features, edit feature array in each training script, e.g.,

# features containing only CamemBert embedding
features_args =['CamemBertEmbeddingFeature'] 

# features containing only FastText Embedding and Word Length
features_args =['FastTextEmbeddingFeature', 'WordLengthFeature'] 

train_and_evaluate_n_times(features_args, n=1) # n=1 means train the model one time, n=5 train 5 times.

All the model checkpoints and report will be saved to the folder models/FR/*

Features

  • FastTextEmbeddingFeature
  • CamemBertEmbeddingFeature
  • WordLengthFeature
  • WordSyllableFeature
  • VowelCountFeature
  • TFIDFFeature
  • WordRankFeature
  • LangGenFrequencyFeature
  • ClearFrequencyFeature

CatBoost Model

python scripts/train_catboost.py

CatBoost Model

python scripts/train_catboost.py

Results

  • The report of each training and evaluation is stored in the folder /models/CNN|CatBoost|XGBoost/*/reports.txt

About

Complex Word Identification (CWI) for French medical documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published