nmrsync is a bash script for periodic synchronization of Bruker NMR data to a data server using rsync. It takes an input file as an argument which provides information including paths, SSH aliases, and rsync options. The script searches for files that have been modified recently (< x days/hr/min as specified in the input file), searching in folders up to the level of RemoteDataPath/(user)/nmr/(data set). For example, if a new spectrum 1/ or 2/ appears in the data set folder, the data set is flagged for syncing, but if something deeper (e.g. a proc file in a pdata folder) is changed, it won't be flagged.
Before syncing, the script can search for folders that are identically named except for case differences (which are unique in Linux but indistinguishable on Windows and Mac), as well as for folders that end in a period (which is permissible on Linux and Mac but not on Windows). The names of these spectra are then placed in SkipFileOld and they are not synced. An email is instead sent to the NMR manager who can then manually change the folder names once the data has finished acquiring. This can also be accomplished automatically using nmrfolderfix (https://github.com/greenwoodad/nmrfolderfix/).
I personally run this as a cron job every five minutes as well as every week with a second input file to ensure data is still eventually transferred after network or power outages. This can also be run (usually with the -full flag) occasionally to back up pp/wave/par folders from the Topspin directories if desired.
This script requires a linux operating system with rsync. It has been tested in CentOS 6.8, 7.5 and Ubuntu 20 on the local side and CentOS 7.5, CentOS 5.1, and RHEL7.3 on the remote side. I've only tested this with Bruker NMR data, but future releases may be able to handle file structures generated by other instruments.
The email feature requires that the command sendmail is working on the machine running the script.
git clone https://github.com/greenwoodad/nmrsync
or
git clone https://(your github username)@github.com/greenwoodad/nmrsyc.git
followed by:
chmod +x ./nmrsync/nmrsync
Because this script is intended to be run as a cron job, it is necessary to authorize the local machine to access the remote machine(s) with password-less ssh login using ssh keys. Tutorials are available here:
Briefly:
- On the machine you want to run the script and send emails from (as the user you want to do this as) run the command:
ssh-keygen -t rsa -b 4096
This will generate files ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub
Press enter at the prompt "Enter passphrase (empty for no passphrase):" to skip passphrase generation.
- Next, run this command (from the local machine) for each remote workstation:
ssh-copy-id remote_username@remote_ip_address
You will be prompted for the password for this remote workstation.
If ssh-copy-id is not available, you should be able to run this instead:
cat ~/.ssh/id_rsa.pub | ssh remote_username@remote_ip_address "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"
- Last, add SSH aliases to your hosts file. In /etc/hosts, add entries:
IPAddress DomainName SSHAlias
for each remote workstation.
for example:
198.51.100.50 dmx500.chem.university.edu DMX500
198.51.100.54 av400.chem.university.edu AV400
198.51.100.59 neo400.chem.university.edu NEO400
The SSHAliases here should be the same SSHAliases you enter in the nmrsync input file.
You should now be able to SSH to the remote workstations without entering a password by typing:
ssh remote_username@SSHAlias
in addition to
ssh remote_username@IPAddress
and
ssh remote_username@DomainName
The first time you do this, you will need to type "yes" to the question "Are you sure you want to continue connecting (yes/no)?" however. After this, you will be able to run the script automatically without manual password entry.
In the input file (nmrsync_input) there are a number of parameters and paths to set:
-
ScriptsPath
: Full path to the location of the main script and the input, emailtxt, and log folders on local machine. Use full path! -
ManagerEmail
: Email address of the NMR facility manager. -
Age
: How many days/hr/min back to look for recent experiments to sync. Without extended outages, ~3 (days) usually works well. -
Timescale
: Units for "Age"-- use 'day' 'hr' or 'min' -
RsyncOptions
: Rsync options. I use '-quvrltD --modify-window=1 --protect-args' -
SkipFlag
: Defines what folders are not synced. 'period' to skip folders ending in a period, 'dup' to skip folders with case-insensitive duplicates, 'both,' or 'none.' Default is 'both.' Note that if a different value of SkipFlag is specified with -s when the script is run, it overrules the value specified in the input file. -
Instrument
: Name of instrument. Can be anything (no spaces) but make sure it is unique (not entered twice in the table). -
/nmr directory?
: Set this to 'y' for the default /(user)/nmr/(data set)/(expt #) data organization on the remote computer. Set it to 'n' for data organized as /(user)/(data set)/(expt #) -
sort data by username
: Set this to 'y' if you want to group all of a user's data in a common folder (data stored as username/instrument/data instead of instrument/username/data) -
SourceDataPath
: Full path containing NMR data, usually on a remote computer. Topspin/ICON-NMR usernames should be found in this folder. Use full path! -
DestinationDataPath
: Full path on local computer to transfer the data to. Can be a mounted directory. Use full path! -
SSHAlias
: Alias for password-less SSH to this instrument computer. Optional if the source path is on the local computer. -
RemoteUser
: User on the remote computer that you can SSH as. Optional if the source path is on the local computer.
IMPORTANT: When editing this file, entries should be separated by either a tab or multiple spaces.
Instruments in the instrument table can be commented out with a #.
NOTE: Additional modifications can be made to the variables 'SendMailPath', 'ManualFlag', 'ExcludeFlag', 'FullFlag', and 'VerboseFlag' at top of the script itself. These are generally the default values for options that can be provided when the script is run (see Usage, below).
nmrsync [OPTIONS]... path/to/nmrsync_input
Options
-h, -?, --help
Show help message.
-i, --input
Set input file (flag optional).
-s, --skip (default 'both')
Set to 'period' to skip folders ending in period,
'dup' to skip case-insensitive duplicates, 'both'
to skip both and 'none' to skip none.
-m, --manual
Manual mode: enter password instead of using SSH
keys--not recommended.
-p, --processed
Processed data mode: will ignore excludelist.instrument
list in the input folder and copy processed data.
-f, --full
Full mode: copy over all data instead of just
recently added data.
-b, --verbose
Verbose mode.
The defaults here can be modified at the top of the script itself.
To run this as a cron job, make an entry in your crontab like this:
*/5 * * * * /path/to/nmrsync "/path/to/input/nmrsync_input"
40 6 * * 0 /path/to/nmrsync "/path/to/input/nmrsync_input_weekly"
In the preceding example, there is a second input file which is configured to run looking for data that has been collected over the last week. The "fast" version is set to run every five minutes ( * /5) while the "slow" version is set to run on Sunday (0) at 6:40 AM (40 6).
Pull requests are welcome.
- Alex Greenwood - provided script - Greenwoodad