Skip to content
Bastian Greshake edited this page Mar 9, 2016 · 6 revisions

Task: merge

This iteratively merges overlapping contigs, then when no more merges can be made, it identifies circular sequences. Each circular sequence is arranged so that the base following the final base is the first base. Note that iterative merging is only performed if the option --reads is used.

Usage and options

The general usage is

circlator merge [options] <original_assembly.fasta> <new_assembly.fasta> <outprefix>

There are the following options:

  • --diagdiff INT: nucmer diagdiff option. Default: 25.
  • --min_id FLOAT: nucmer minimum percent identity. Default: 95.
  • --min_length INT: minimum length of hit for nucmer to report. Default: 500.
  • --min_length_merge INT: minimum length of nucmer hit to use when merging. Default: 4000.
  • --breaklen INT: breaklen option used by nucmer. Default: 500.
  • --min_spades_circ_pc FLOAT: minimum percent of contigs needed to be covered by nucmer hits to spades circular contigs. Default: 95.
  • --spades_k k1,k2,k3,...: Comma separated list of kmers to use when running SPAdes. Max kmer is 127 and each kmer should be an odd integer. Default: 127,117,107,97,87,77.
  • --spades_use_first: Use the first successful SPAdes assembly. Default is to try all kmers and use the assembly with the largest N50.
  • --assemble_not_careful: Do not use the --careful option with SPAdes (used by default)
  • --assemble_not_only_assembler: Do not use the --assemble-only option with SPAdes (used by default)
  • --ref_end INT: maximum distance allowed between nucmer hit and end of input assembly contig. Default: 15000.
  • --reassemble_end INT: max distance allowed between nucmer hit and end of reassembly contig. Default: 1000.
  • --threads INT: number of threads for remapping/assembly (only applies if --reads is used). Default: 1
  • --reads FILENAME: FASTA file of corrected reads that made the new assembly. Using this triggers iterative contig pair merging.
  • --verbose. Be verbose

Output files

The final output contigs are written to outprefix.fasta. There are four log files written, described below. Intermediate files are also kept from the iterations of contig merging. These files are called outprefix.merge.iter.*.

Merging log files

If the option --reads is used (or --no_pair_merge is not used when running circlator all), then contigs are iteratively merged in pairs using local assemblies. This stage makes two log files. The file outprefix.merge.log summarizes the contig merging. For example:

[merge contig_merge] #new_name               previous_contig1 previous_contig2
[merge contig_merge] contig1.contig2         contig1          contig2
[merge contig_merge] contig1.contig2.contig3 contig1.contig2  contig3
[merge contig_merge] contig10.contig20       contig10         contig20

In this example, contig1 and contig2 were merged, then in a later iteration, the merged contig contig1.contig2 was merged with contig3 to make a new contig called contig1.contig2.contig3. Also, contig10 and contig20 were merged.

The second log file outprefix.merge.iterations.log contains verbose details of nucmer matches between the input assembly contigs and the reassembled contigs made by SPAdes. It is described in the troubleshooting section. At each iteration, it reports which nucmer matches were considered when merging contig pairs, and whether or not (and why) those nucmer matches were accepted or rejected for merging.

Circularization log files

The file outprefix.circularise.log summarizes whether or not, and why, each contig was circularized. An example with all the possibilities is:

[merge circularised] #Contig repetitive_deleted circl_using_nucmer circl_using_spades circularised
[merge circularised] contig1 0                  0                  0                  0
[merge circularised] contig2 0                  0                  1                  1
[merge circularised] contig3 0                  0                  1                  1
[merge circularised] contig4 1                  0                  0                  0

In this example, contig1 was not changed. Contig2 was circularized using matches at its ends to a contig in the reassembly of reads that mapped to contig ends. Contig3 was circularized because it matched a contig, say spades_contig2, in the reassembly that SPAdes identified as circular. Contig4 was removed because it also matched spades_contig2, so was identified as a redundant contig.

A second log file, called outprefix.circularise_details.log and described in the troubleshooting section, has details of the potential nucmer matches considered during circularization. It shows which matches were accepted or rejected and the reasons why.