Corpus is an R text processing package with full support for international text (Unicode). It includes functions for reading data from newline-delimited JSON files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies (including n-grams).
Corpus does not provide any language models, part-of-speech tagging, topic models, or word vectors, but it can be used in conjunction with other packages that provide these features.
Corpus is available on CRAN.To install the latest released version, run the following command in R:
install.packages("corpus")
To install the latest development version, run the following:
tmp <- tempfile()
system2("git", c("clone", "--recursive",
shQuote("https://github.com/patperry/r-corpus.git"), shQuote(tmp)))
devtools::install(tmp)
Note that corpus uses a git submodule, so you cannot use
devtools::install_github
.
Here's how to get the most common non-punctuation, non-stop-word terms in The Federalist Papers:
> term_stats(federalist, drop = stopwords_en, drop_punct = TRUE)
term count support
1 government 825 85
2 state 787 85
3 people 612 85
4 one 544 85
5 new 324 85
6 york 151 85
7 publius 85 85
8 may 812 84
9 states 845 82
10 power 606 82
11 must 446 81
12 can 464 78
13 every 350 77
14 part 226 77
15 constitution 462 76
16 might 322 76
17 general 255 76
18 time 249 76
19 great 291 74
20 public 282 74
⋮ (8631 rows total)
Here's how to find all instances of tokens that stem to "power":
> text_locate(federalist, "power", stemmer = "en")
text before instance after
1 1 …ay hazard a diminution of the power , emolument,\nand consequence …
2 1 …s. So numerous indeed and so\n powerful are the causes which serve to…
3 1 … of a temper fond of despotic power and\nhostile to the principle…
4 2 …der to vest it with requisite powers . It is well worthy\nof consid…
5 2 …head of each the same kind of powers which they are advised to\npl…
6 2 …\nwithout having been awed by power , or influenced by any passion…
7 3 …ment, vested with sufficient\n powers for all general and national …
8 3 … of nations towards all these powers , and to me it\nappears eviden…
9 3 …he wrong themselves, nor want power or\ninclination to prevent or…
10 3 …it will also be more in their power to\naccommodate and settle th…
11 3 …cy of little consideration or power .\n\nIn the year 1685, the sta…
12 3 …ain, or Britain, or any other POWERFUL nation?\n\nPUBLIUS.\n
13 4 … our advancement in union, in power and\nconsequence by land and …
14 4 …t can apply the resources and power of the whole to the\ndefense …
15 4 …\ncombining and directing the powers and resources of the whole, w…
16 5 …h tend to beget and\nincrease power in one part and to impede its…
17 6 … description are the love of\n power or the desire of pre-eminence…
18 6 …nd dominion--the jealousy of\n power , or the desire of equality an…
19 6 …rest of this enterprising and powerful monarch, he\nprecipitated Eng…
20 6 …rprising a passion as that of power or glory? Have there not\nbee…
⋮ (912 rows total)
Here's how to get a term frequency matrix of all 1-, 2-, 3-, 4-, and 5-grams.
> system.time(x <- term_matrix(federalist, ngrams = 1:5))
user system elapsed
2.781 0.123 2.906
This computation uses only a single CPU, yet it still completes in under three seconds.
For a more complete introduction to the package, see the getting started guide and the other articles at corpustext.com.
Cite corpus with the following BibTeX entry:
@Manual{,
title = {corpus: Text Corpus Analysis},
author = {Patrick O. Perry},
year = {2017},
note = {R package version 0.10.0},
url = {http://corpustext.com},
}
The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you'd like to contribute, either
-
fork the repository and submit a pull request (note the nonstandard instructions for building from source);
-
or contact the maintainer via e-mail.
This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.
The API and feature set for corpus draw inspiration from quanteda, developed by Ken Benoit and collaborators; stringr, developed by Hadley Wickham; and tidytext, developed by Julia Silge and David Robinson.