one sentence from JNLPBA-dataset visualized with doccano
pip install -r requirements
python -m spacy download en_core_web_sm
python -c "from util.data_io import download_data; download_data('http://nlp.cs.washington.edu/sciIE/data','sciERC_processed.tar.gz','data',unzip_it=True)"
git clone https://github.com/allenai/scibert.git
see scibert/data/ner/JNLPBA
- sequence-tagger: spacy-features + crfsuite
- 5 times 10 "steps"
- entropy/uncertainty -based sampling seems not beneficial if model is dumb (too few traindata or too shallow?)
- 20% of train-data, evaluated on test-set (which is not splitted)
- why is farm so bad? where is the bug?
sequence tagging transformers + lightning
git clone https://github.com/dertilo/transformers.git
git checkout lightning_examples
cd transformers/examples && pip install -r requirements.txt
- on frontend:
OMP_NUM_THREADS=2 wandb init
- on frontend:
OMP_NUM_THREADS=8 bash download_data.sh
- on node:
python preprocess.py --model_name_or_path bert-base-multilingual-cased --max_seq_length 128
- on node:
export PYTHONPATH=~/transformers/examples
- on frontend: to download pretrained model:
OMP_NUM_THREADS=8 python3 run_pl_ner.py --data_dir ./ --labels ./labels.txt --model_name_or_path $BERT_MODEL --do_train
PYTHONPATH=~/transformers/examples WANDB_MODE=dryrun python ~/transformers/examples/token-classification/run_pl_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path bert-base-multilingual-cased \
--output_dir germeval2014 \
--max_seq_length 128 \
--num_train_epochs 3 \
--train_batch_size 32 \
--seed 1 \
--do_train \
--do_predict
- sync with wandb:
OMP_NUM_THREADS=2 wandb sync wandb/dryrun-...
- resuls after 3 epochs in ~20 minutes:
TEST RESULTS
{'avg_test_loss': tensor(0.0733),
'f1': 0.8625160051216388,
'precision': 0.8529597974042419,
'recall': 0.8722887665911299,
'val_loss': tensor(0.0733)}