Apply samtools addreplacerg
to add a new read group and assign all reads to it in a set of BAM files using their names.
This scripts takes a set of BAM files (called *.bam
) grouped in a single folder as an input.
In each BAM file, the read group ID and SM fields will be set to BAM file name after removing the .bam
extension. The command applied to each file is:
samtools addreplacerg -r "@RG\tID:file_name\tPG:samtools addreplacerg\tSM:file_name}"
-
Install java JRE if you don't already have it.
-
Install nextflow.
curl -fsSL get.nextflow.io | bash
And move it to a location in your
$PATH
(/usr/local/bin
for example here):sudo mv nextflow /usr/local/bin
-
Install samtools 1.3 or above and add it to your path. Alternatively, you can use the docker image provided (see below).
Simply use:
nextflow run iarcbioinfo/addreplacerg-nf --bam_folder BAM/
By default, BAM files produced are output in the same folder as the input folder. One can also specify the output folder by adding the optional argument --out_folder BAM_RG
to the above command line for example.
If you don't have samtools
you can use the docker image we provide containing it using:
nextflow run iarcbioinfo/addreplacerg-nf -with-docker --bam_folder BAM/
Installing docker is very system specific (but quite easy in most cases), follow docker documentation. Also follow the optional configuration step called Create a Docker group
in their documentation.
The exact same pipeline can be run on your computer or on a HPC cluster, by adding a nextflow configuration file to choose an appropriate executor. For example to work on a cluster using SGE scheduler, simply add a file named nextflow.config
in the current directory (or ~/.nextflow/config
to make global changes) containing:
process.executor = 'sge'
Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation here for more details. Also have a look at the other parameters for the executors, in particular queueSize
that defines the number of tasks the executor will handle in a parallel manner.
The default number of tasks the executor will handle in a parallel is 100, which is certainly too high if you are executing it on your local machine. In this case a good idea is to set it to the number of computing cores your local machine has. Following is an example to create a config file with this information automatically (works on Linux and Mac OS X):
echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` > ~/.nextflow/config
Replace >
by >>
if you want to add the argument line to an existing nextflow config file.