strcompress
shrinks Stata data files by running compress
after converting string variables to Stata's "long string" (strL) format.
net install strcompress, from(https://raw.github.com/lukestein/strcompress/master/)
strcompress
strcompress
can result in significant space savings, especially for large files containing string variables with different-length contents, repeated contents, or many missing values.
For example, strcompress
shrinks this 2.5gb file containing census data from IPUMS by 41% (versus 25% running just compress
):
Converting a string variable to strL may initially make it require more space, but in these cases the compress
step will convert the variable back to a string. (I’m basically certain that your file will always wind up smaller than it started.) For example:
With an optional varlist, strcompress
will only attempt strL conversion and compression on the named variables.
Other notes:
- strL variables cannot be used as key variables in merges.
- strL variables cannot be used with the -fillin- command.
- Running
strcompress
can take a while on large files with large string variables. - If you have string variables that take on a reasonable number of distinct values, it may be a good idea to
encode
them before runningstrcompress
; you may want to useunique
(orgunique
) to identify good candidates for encoding. - This code is doing nothing fancy; you can achieve exactly the same results just by running e.g.,
recast strL var1 var2 var3
followed bycompress
. - To check if there are updates, just run
ado update strcompress
. - Inspired by this Twitter thread.