RCALLSTRINGDIST: Call R's stringdist package from Stata using rcall

Current version: 0.3.0 11jul2019
Contents: updates description install usage author

Major updates

0.3.0 11jul2019:
- adds additional options to clean strings before comparison (ignorecase, ascii, whitespace, punctuation)
- moved handling out of csv and into .dta files (faster when merging with original data) using R package haven (appears better at handling diacritics than readstata13)
- small speed improvements
0.2.0 16apr2019:
- adds several options: matrix (for one and two variables), duplicates, and sortwords
- (0.2.3) significant increases in speed
0.1.0 15apr2019:
- first version of the command

Description

This command uses rcall to call R's stringdist. It allows the user to obtain various measures of distances between text strings.

I'd like to thank the authors of both packages:

stringdist was written by Mark van der Loo, Jan van der Laan, R Core Team, Nick Logan, and Chris Muir.
rcall was written by E. F. Haghish

Alternatives

StataStringUtilities by William Buchanan

Install

Install R directly or with RStudio for a graphical interface.
Install this package using the github command by E. F. Haghish. This will also install dependencies automatically.

net install github, from("https://haghish.github.io/github/") replace
github install luispfonseca/stata-rcallstringdist

Dependencies

This Stata package requires R, the stringdist and haven R packages, and the rcall Stata package.

Optional dependencies:

Commands from gtools by Mauricio Caceres Bravo are used to speed up the command when available.

If R is installed on your machine, all these dependencies will be automatically installed when following the earlier instrutions. The file dependency.do is executed automatically after installing rcallstringdist package. Make sure R is installed on your machine before you attempt to install these packages on Stata.

Usage

See the help file in Stata for details about each option.

* Comparing two lists of strings
clear
input str30 nameA
"Gates Bill"
"Gates, Bill"
"bill gates"
"William H. Gates III"
end

input str30 nameB
"Bill Gates"
"Bill Gates"
"Bill Gates"
"William Henry Gates III"

compress

** Comparing two variables, row by row
*** default method (osa), default arguments, default generated variable name
rcallstringdist nameA nameB
*** specific variable names
rcallstringdist nameA nameB, gen(osa)
rcallstringdist nameA nameB, method(cosine) q(3) gen(cosine)
*** sometimes it's worth sorting words within each string. 
*** the first row will now be a perfect match
rcallstringdist nameA nameB, gen(osa_sortw) sortwords
*** it can also be worth cleaning up the strings before feeding them 
****(e.g. lowercase, remove punctuation and diacritics)
gen nameAclean = lower(nameA)
gen nameBclean = lower(nameB)
rcallstringdist nameAclean nameBclean, gen(osa_clean)
rcallstringdist nameAclean nameBclean, gen(osa_clean_sortw) sortwords

** Comparing two variables, all possible combinations
*** by calling the matrix option, we can compare all possible combinations 
*** of strings from one variable with the other variable
*** be aware: this option will clear your current working dataset from memory
*** see the following example
clear
input str30 nameA
"Gates Bill"
"Gates, Bill"
"bill gates"
"William H. Gates III"
"Bill Gates"
"Bill Gates"
end

input str30 nameB
"Bill Gates"
"William Henry Gates III"
"Bill Gates"
end

compress

save example_dataset

*** each string of nameA will be compared with each string of nameB
*** nameA has 5 unique strings, while name B has 2
*** 10 pairs will be compared
rcallstringdist nameA nameB, matrix

* Comparing one list of strings with itself, all possible combinations
*** if only one variable is passed, compare all pairs of strings within
*** we have 5 unique strings, 5x4/2=10 combinations
use example_dataset, clear
rcallstringdist nameA, matrix
*** to keep all permutations (5x4=20), we can use the keepduplicates option
use example_dataset, clear
rcallstringdist nameA, matrix keepduplicates

Author

Luís Fonseca
London Business School
lfonseca london edu
https://luispfonseca.com

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dependency.do		dependency.do
example.do		example.do
rcallstringdist.ado		rcallstringdist.ado
rcallstringdist.pkg		rcallstringdist.pkg
rcallstringdist.sthlp		rcallstringdist.sthlp
stata.toc		stata.toc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RCALLSTRINGDIST: Call R's stringdist package from Stata using rcall

Major updates

Description

Alternatives

Install

Dependencies

Usage

Author

About

Releases

Packages

Contributors 2

Languages

License

luispfonseca/stata-rcallstringdist

Folders and files

Latest commit

History

Repository files navigation

RCALLSTRINGDIST: Call R's stringdist package from Stata using rcall

Major updates

Description

Alternatives

Install

Dependencies

Usage

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages