Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs object expects all word frequencies to be 1 - transformation from dfm object (quanteda) #10

Open
JonasRieger opened this issue Jun 29, 2021 · 2 comments
Labels
usability Enhancement of user friendliness

Comments

@JonasRieger
Copy link
Owner

The docs object expects (for technical reasons) that all words occur with frequency 1. If words occur several times, they appear several times each with frequency 1.
In the quanteda package there are dfm objects that also allow values greater than 1. If you do your preprocessing in quanteda and want to use quanteda::dfm2lda to convert your object into the necessary structure, you need one more step to fulfill the requirements for the docs object. Just execute the following line:

docs = lapply(docs, function(x) rbind(rep(x[1,], x[2,]), 1))

This replicates words with multiple occurrences and protects you from the error message all(sapply(docs, function(x) all(x[2, ] == 1))) is not TRUE in LDARep and similar functions.

@JonasRieger JonasRieger added the usability Enhancement of user friendliness label Jun 29, 2021
@abitter
Copy link

abitter commented Nov 5, 2021

Unfortunately, this yields a numeric matrix (at least in R 4.1.1), whereas LDARep expects an integer matrix.
There might be a more elegant solution, but this did the trick for me:

docs <- lapply(docs, function(x) rbind(rep(as.integer(x[1,]), as.integer(x[2,])), as.integer(1)))

@JonasRieger
Copy link
Owner Author

Yeah, you're right.

docs = convert(dfmat, "lda")$documents
docs = lapply(docs, function(x) rbind(rep(x[1,], x[2,]), 1L))

should do it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability Enhancement of user friendliness
Projects
None yet
Development

No branches or pull requests

2 participants