A package for encoding and decoding arbitrary byte data to and from strands of DNA using a robust an error-correcting code (ECC).
William H. Press, John A. Hawkins, Stephen Knox Jones Jr, Jeffrey M. Schaub, and Ilya J. Finkelstein
Proc Natl Acad Sci. accepted for publication (June, 2020)
The following instructions should work across platforms, except that installing virtualenv with apt-get is Ubuntu specific. For other platforms, install virtualenv appropriately if desired.
First, clone the repository to a local directory:
git clone https://github.com/whpress/hedges.git
Optionally, you can install into a virtual environment (recommended):
sudo apt-get install -y virtualenv
cd hedges
virtualenv envhedges
. envhedges/bin/activate
Now install required packages:
pip install numpy==1.13.3 && pip install -r requirements.txt && python setup.py install
Supplied is not a single program, but a kit for variable user applications. The kit consists of
-
C++ source code that compiles (in Linux or Windows) to the Python-includable module
NRpyDNAcode
. Precompiled binaries are supplied for Python 2.7 in Linux and Windows, but recompilation may be necessary if these don't work. This module implements the HEDGES "inner code" as described in the paper. -
C++ source code that compiles (in Linux or Windows) to the Python-includable module
NRpyRS
. Precompiled binaries are supplied for Python 2.7 in Linux and Windows, but recompilation may be necessary if these don't work. This module implements the Schifra Reed-Solomon Error Correcting Code Library. See http://www.schifra.com for details and license restrictions. This module is not needed for the HEDGES inner code, but is needed only to implement the "outer code" as described in the paper. Some users will instead want to utilize their own outer codes. -
Python program
print_module_test_files.py
, which verifies that the above modules can be loaded and prints their usage. Most users will not need to use any of the routines in these files directly, but should instead use the Python functions in the following file: -
Python program
test_program.py
. This defines various user-level functions for implementing the HEDGES inner and Reed-Solomon outer codes as described in the paper. The example inputs arbitrary bytes from the fileWizardOfOzInEsperanto.txt
, encodes a specified number of packets (each with 255 DNA strands), corrupts the strands with a specified level of random substitutions, insertions, and deletions, decodes the strands, and verifies the error correction. To better validate the installation, the code rate and corruption level set by default are chosen to be stressful to HEDGES and is greater than that in an intended use case.
Run the program test_program.py
. It should produce output comparable (but not identical) to the files sample_linux_test_output.txt
and sample_windows_test_output.txt
. The output will not be identical, because different random numbers are used to create DNA errors in each run.
If the above works, then try varying some of the parameters. In particular, you can change coderatecode
to increase or decrease the code rate, the values (srate,drate,irate)
to change the fraction of substitutions, deletions, and insertions generated for the test, and totstrandlen
, the total strand length of the DNA (including left and right primers). The many other parameters are either self-explanatory, or else described in the paper. Most users will not initially need to change them.
The modules are built using the Numerical Recipes C++ class library nr3python.h
. This is included here and also freely available for unlimited distribution at http://numerical.recipes/nr3python.h . Generally, you will not need to understand this library, but, if you are curious, a tutorial on its use is at http://numerical.recipes/nr3_python_tutorial.html . You should also consult this tutorial if you have difficulty recompiling the modules. Note that while other Numerical Recipes routines are copyright and require a license, no restricted routines are used in the two modules here supplied.
In Linux, go to the directory LinuxC++Compile
containing the source code and run the script compile_all.sh
. Then copy the two files produced, NRpyDNAcode.so
and NRpyRS.so
, to the directory containing test_program.py
. The most common source of errors is the compiler's inability to find required Python and Numpy include and library files that are part of your Python installation. Unfortunately, we can't help you with that.
In Windows, go to the directory WindowsC++Compile
and fire up the Community Visual Studio 2019 solution NRpyDNAcode.sln
. This should build the two files (in the x64\Release
directory) NRpyDNAcode.pyd
and NRpyRS.pyd
. Copy these to the directory containing test_program.py
. If this doesn't work, and you need to build your the Windows modules from scratch, then keep these points in mind: You want to compile to produce .dll files (not .exe files), and you want to then simply rename these to .pyd. As in Linux, a common source of errors is the compiler's inability to find required Python and Numpy include and library files that are part of your Python installation. You'll need to locate them and set appropriate include directories.
Written with StackEdit.