wordpiece.data

The goal of wordpiece.data is to provide stable, versioned data for use in the {wordpiece} tokenizer package.

Installation

You can install the released version of wordpiece.data from CRAN with:

install.packages("wordpiece.data")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("macmillancontentscience/wordpiece.data")

Dataset Creation

The datasets included in this package were retrieved from huggingface (specifically, cased and uncased). They were then processed using the {wordpiece} package. This is a bit circular, because this package is a dependency for the wordpiece package.

vocab_txt <- tempfile(fileext = ".txt")
download.file(
  url = "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt", 
  destfile = vocab_txt
)
parsed_vocab <- wordpiece::load_vocab(vocab_txt)
rds_filename <- paste0(
  paste(
    "wordpiece",
    "cased",
    length(parsed_vocab),
    sep = "_"
  ),
  ".rds"
)
saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)

vocab_txt <- tempfile(fileext = ".txt")
download.file(
  url = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt", 
  destfile = vocab_txt
)
parsed_vocab <- wordpiece::load_vocab(vocab_txt)
rds_filename <- paste0(
  paste(
    "wordpiece",
    "uncased",
    length(parsed_vocab),
    sep = "_"
  ),
  ".rds"
)
saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)

Example

You likely won’t ever need to use this package directly. It contains a function to load data used by {wordpiece}.

library(wordpiece.data)

head(wordpiece_vocab())
#> [1] "[PAD]"     "[unused0]" "[unused1]" "[unused2]" "[unused3]" "[unused4]"

Code of Conduct

Please note that the wordpiece.data project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Disclaimer

This is not an officially supported Macmillan Learning product.

Contact information

Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
R		R
inst/rds		inst/rds
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
wordpiece.data.Rproj		wordpiece.data.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordpiece.data

Installation

Dataset Creation

Example

Code of Conduct

Disclaimer

Contact information

About

Releases 3

Packages

Languages

License

macmillancontentscience/wordpiece.data

Folders and files

Latest commit

History

Repository files navigation

wordpiece.data

Installation

Dataset Creation

Example

Code of Conduct

Disclaimer

Contact information

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages