Skip to content

Living-with-machines/ERWT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ERWT//PEA Language Models

ERWT (de; v(m); meervoud: erwten): naam van verschillende peulvruchten

PEA noun: pea; plural noun: peas: a spherical green seed that is eaten as a vegetable or as a pulse when dried.

Models available on the Living with Machines HuggingFace page.

Consult Model Cards for more information.

Models

Livingwithmachines/erwt-year: DistilBERT model trained on 0.5 billion tokens (see below for path to dataset). We divided the text into chunks of 100 words (i.e. no sentence splitting).

To use, please preprocess text as follows:

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/erwt-year')

preds = mask_filler("1810 [DATE] [MASK] Majesty.")

Livingwithmachines/erwt-year-masked-25: DistilBERT model trained on same data as the above. Dates are added as standard tokens, and separated from the text with a [DATE] special token. We masked the time token with probabiliy of 0.25.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/erwt-year-masked-25')

preds = mask_filler("1810 [DATE] [MASK] Majesty.")

Livingwithmachines/erwt-year-masked-75: DistilBERT model trained on same data as the above. Dates are added as standard tokens, and separated from the text with a [DATE] special token. We masked the time token with probabiliy of 0.75.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/erwt-year-masked-75')

preds = mask_filler("1810 [DATE] [MASK] Majesty.")

hmd-text-only TBC

Livingwithmachines/hmd-text-only: DistilBERT model trained on HMD without additional prepended metadata.

text = f"[MASK] Majesty."
preds = mask_filler(text)

Model Card under construction 🚧

Livingwithmachines/pea-pol-st: DistilBERT trained with metadata on political leaning. The political labels are

['[lib]', '[con]', '[none]', '[rad]', '[neutr]']

Metadata and text and separated by a special [POL] token.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/pea-pol-st')

preds = mask_filler("[lib] [POL] [MASK] Majesty.")

Model Card under construction 🚧

Livingwithmachines/pea-pol: DistilBERT trained with metadata on political leaning. The political labels are from the tokenizer's standard vocabulary.

['liberal', 'conservative', 'none', 'radical', 'neutral']

Metadata and text and separated by a special [POL] token.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/pea-pol')

preds = mask_filler("liberal [POL] [MASK] Majesty.")

Model Card under construction 🚧

Livingwithmachines/pea-year-pol-loc: DistilBERT trained with metadata on year, political leaning and location. The political labels are from the tokenizer's standard vocabulary.

['liberal', 'conservative', 'none', 'radical', 'neutral']

The same applies to year and location.

['liverpool', 'london']

Metadata and text and separated by a special [DATE], [POL] and [LOC] token.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/pea-year-pol-loc')

preds = mask_filler("1861 [DATE] liberal [POL] london [LOC] [MASK] Majesty.")

Data

Frozen corpus

Main corpus, to be kept constant during all experiments. Saved in arrow format. To load the dataset use:

train_val_split = load_from_disk('/datadrive_2/frozen_corpus')

/datadrive_2/frozen_corpus: The frozen_corpus was constructed by chunking the complete HMD into segments of hundred words, after which we donwsampled to approx half a billion tokens. The content should look as follows.

DatasetDict({
    train: Dataset({
        features: ['year', 'nlp', 'pol', 'loc', 'sentences', 'ocr', 'length'],
        num_rows: 5234550
    })
    test: Dataset({
        features: ['year', 'nlp', 'pol', 'loc', 'sentences', 'ocr', 'length'],
        num_rows: 581857
    })
})

The sentences columns contains the chunk of hundred words (should be renamed). loc indexes the place of publication as special token [liverpool] and [london]. pol contains the political leaning of the paper ([conservative], [liberal] etc.)

ocr is length columns are derived from the original article to which the segment belonged and refer to the average OCR quality and respectively the number of tokens in the article.

HMD Enriched

/datadrive_2/HMD_context: original HMD corpus, but with political leaning and NLP identifier add to each example.

Code

Code is currently stored in /datadrive/bNERT (should be pushed to GitHub after cleaning). Main Notebook is PrepareDataset.ipynb which includes a data loading, preprocessing and training code. This is still under construction and should will be cleaned later on.

Depreciated

bnert_time

/datadrive_2/bnert_time: DistilBERT model created by earlier experiments. Trained on sentences longer than 10 topics and and OCR quality higher than 0.75. For this models we simple prepended the sentences with the year and a [SEP] special token. Example of use:

tokenizer = AutoTokenizer.from_pretrained("/datadrive_2/bnert_time")
mask_filler = pipeline(
    "fill-mask", model="/datadrive_2/bnert_time", top_k=10, tokenizer=tokenizer
)

text = f"1810 [SEP] [MASK] Majesty."
preds = mask_filler(text)

Original train-test split is lost.

erwt-year-st

Livingwithmachines/erwt-year-st: DistilBERT model trained on 0.5 billion tokens (see below for path to dataset). We divided the text into chunks of 100 words (i.e. no sentence splitting).

To use, please preprocess text as follows:

text = f"[1810] [SEP] [MASK] Majesty."
preds = mask_filler(text)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •