-
Notifications
You must be signed in to change notification settings - Fork 0
/
diff.tex
2698 lines (2398 loc) · 250 KB
/
diff.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%--------------------------------------------------------%
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL MiCoNE-pipeline-paper-old/main.tex Fri Sep 2 10:21:30 2022
%DIF ADD MiCoNE-pipeline-paper/main.tex Wed Sep 28 18:31:30 2022
% Journal Article Manuscript Template
%--------------------------------------------------------%
%!TEX root = ../main.tex
%--------------------------------------------------------%
% DOCUMENT CLASS
%--------------------------------------------------------%
% Change "letterpaper" to "a4" if you use a4 paper size
\documentclass[letterpaper,12pt]{article}
%--------------------------------------------------------%
% TITLE SECTION
%--------------------------------------------------------%
%Abstract
\usepackage{abstract} % Allows abstract customization
% Set the "Abstract" text to bold
\renewcommand{\abstractnamefont}{\normalfont\bfseries}
% Set the abstract itself to small italic text
\renewcommand{\abstracttextfont}{\normalfont\small\itshape}
%Title
\usepackage{titlesec} % Allows customization of titles
%Authors
\usepackage{authblk} % For multiple authors
%Date
\usepackage{datetime} % allows for including today's date
% These two lines creates a new date format ``Month day(th), year''
\newdateformat{usvardate}{
\monthname[\THEMONTH] \ordinal{DAY}, \THEYEAR}
%--------------------------------------------------------%
% HEADERS & FOOTERS
%--------------------------------------------------------%
%Footnotes
%DIF 42c42
%DIF < \usepackage[bottom]{footmisc} % Makes footnotes stick to bottom of the page
%DIF -------
% \usepackage[bottom]{footmisc} % Makes footnotes stick to bottom of the page %DIF >
%DIF -------
%Headers from page 2 on
%DIF 45-47c45-47
%DIF < \usepackage{fancyhdr}
%DIF < \pagestyle{fancy}
%DIF < \fancyheadoffset{0cm}
%DIF -------
% \usepackage{fancyhdr} %DIF >
\pagestyle{plain} %DIF >
% \fancyheadoffset{0cm} %DIF >
%DIF -------
% \setlength{\headheight}{15pt}
%--------------------------------------------------------%
% MACROS
%--------------------------------------------------------%
% Define keywords macro command
\providecommand{\keywords}[1]{\textbf{\textit{Keywords---}} #1}
%--------------------------------------------------------%
% MATH SUPPORT
%--------------------------------------------------------%
% The amssymb package provides various useful mathematical symbols
\usepackage{amssymb}
% The amsthm package provides extended theorem environments
\usepackage{amsthm}
% The newtxmath package provides additional math symbol support
% in Times New Roman symbols, etc.
\usepackage{newtxmath}
%DIF 68a68-69
\usepackage{mathtools} %DIF >
\usepackage{blkarray, bigstrut} %DIF >
%DIF -------
%--------------------------------------------------------%
% FONTS
%--------------------------------------------------------%
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[utf8]{inputenc}
\usepackage{newtxtext} % Makes default font Adobe Times New Roman
%--------------------------------------------------------%
% LINES
%--------------------------------------------------------%
% Spacing
\usepackage{setspace} % See \doublespacing command at the top of content.tex
% Numbering
%DIF 84c86-87
%DIF < \usepackage{lineno,xcolor} % See \linenumbers at the top of content.tex
%DIF -------
\usepackage{lineno} % See \linenumbers at the top of content.tex %DIF >
\usepackage[table,x11names]{xcolor} %DIF >
%DIF -------
% Lists
\usepackage{enumitem}
\setlist{nosep}
\setlist[itemize]{leftmargin=*}
%--------------------------------------------------------%
% MARGINS
%--------------------------------------------------------%
%NOTE: All spaces in this template are in inches, because it is
% formatted for letterpaper (8.5 x 11 inch) paper. If you use a4
% paper, choose different sizes in millimeters or centimeters.
\usepackage[top=1.5in, bottom=1.5in, left=1in, right=1in]{geometry}
%--------------------------------------------------------%
% COMMENTS
%--------------------------------------------------------%
%DIF 103c106
%DIF < \usepackage[colorinlistoftodos]{todonotes} % allows margin comments
%DIF -------
% \usepackage[colorinlistoftodos]{todonotes} % allows margin comments %DIF >
%DIF -------
% See examples in content.tex, and here for manual:
% http://www.ctan.org/pkg/todonotes
\usepackage{soul} % allows for highlighting
%--------------------------------------------------------%
% ACRONYMS
%--------------------------------------------------------%
\usepackage[nohyperlinks,nolist]{acronym} % Managing acronyms
%--------------------------------------------------------%
% GRAPHICS
%--------------------------------------------------------%
\usepackage{graphicx,caption} % More advanced figure inclusion
\graphicspath{{figures/}} % Set the default folder for images
\usepackage{float} % For specifying table/figure locations, i.e. [ht!]
% The printlen command allows the user to print the exact text width or height.
% This is useful, when trying to create graphics (outside of LaTeX, of course)
% with the optimal dimensions. See here for usage: http://www.ctan.org/pkg/printlen
\usepackage{printlen}
\usepackage[section]{placeins} % Used to ensure that figures do not go into the next section
%--------------------------------------------------------%
% TABLES
%--------------------------------------------------------%
\usepackage{longtable} % For long tables that span multiple pages
\newcommand{\sym}[1]{\rlap{#1}}% For symbols like *** in tables
\usepackage{tabularx} % Allows advanced table features
\newcolumntype{L}[1]{>{\raggedright\arraybackslash}p{#1}}
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}}
\newcolumntype{R}[1]{>{\raggedleft\arraybackslash}p{#1}}
\usepackage{relsize} % Allows precise adjustment of font size,
%useful for fitting tables to page width
\usepackage{multirow}
%DIF 143a146-147
%for horizontal tables %DIF >
\usepackage{lscape} %DIF >
%DIF -------
%--------------------------------------------------------%
% REFERENCES
%--------------------------------------------------------%
\usepackage{hyperref} % For hyperlinks in the PDF
\usepackage{csquotes}
%DIF 150c155
%DIF < \usepackage[style=numeric,backend=biber,sorting=none]{biblatex}
%DIF -------
\usepackage[style=nature,url=false,backend=biber,sorting=none]{biblatex} %DIF >
%DIF -------
\bibliography{references/references.bib}
% Edit preamble.tex to change the overall layout
% Header from Page Three on: Edit below for left and right headers
%DIF 155-156c160-161
%DIF < \lhead{}
%DIF < \rhead{}
%DIF -------
% \lhead{} %DIF >
% \rhead{} %DIF >
%DIF -------
%--------------------------------------------------------%
% BEGIN DOCUMENT
%--------------------------------------------------------%
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFaddtex}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdeltex}[1]{{\protect\color{red}\sout{#1}}} %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF HYPERREF PREAMBLE %DIF PREAMBLE
\providecommand{\DIFadd}[1]{\texorpdfstring{\DIFaddtex{#1}}{#1}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{\texorpdfstring{\DIFdeltex{#1}}{}} %DIF PREAMBLE
\newcommand{\DIFscaledelfig}{0.5}
%DIF HIGHLIGHTGRAPHICS PREAMBLE %DIF PREAMBLE
\RequirePackage{settobox} %DIF PREAMBLE
\RequirePackage{letltxmacro} %DIF PREAMBLE
\newsavebox{\DIFdelgraphicsbox} %DIF PREAMBLE
\newlength{\DIFdelgraphicswidth} %DIF PREAMBLE
\newlength{\DIFdelgraphicsheight} %DIF PREAMBLE
% store original definition of \includegraphics %DIF PREAMBLE
\LetLtxMacro{\DIFOincludegraphics}{\includegraphics} %DIF PREAMBLE
\newcommand{\DIFaddincludegraphics}[2][]{{\color{blue}\fbox{\DIFOincludegraphics[#1]{#2}}}} %DIF PREAMBLE
\newcommand{\DIFdelincludegraphics}[2][]{% %DIF PREAMBLE
\sbox{\DIFdelgraphicsbox}{\DIFOincludegraphics[#1]{#2}}% %DIF PREAMBLE
\settoboxwidth{\DIFdelgraphicswidth}{\DIFdelgraphicsbox} %DIF PREAMBLE
\settoboxtotalheight{\DIFdelgraphicsheight}{\DIFdelgraphicsbox} %DIF PREAMBLE
\scalebox{\DIFscaledelfig}{% %DIF PREAMBLE
\parbox[b]{\DIFdelgraphicswidth}{\usebox{\DIFdelgraphicsbox}\\[-\baselineskip] \rule{\DIFdelgraphicswidth}{0em}}\llap{\resizebox{\DIFdelgraphicswidth}{\DIFdelgraphicsheight}{% %DIF PREAMBLE
\setlength{\unitlength}{\DIFdelgraphicswidth}% %DIF PREAMBLE
\begin{picture}(1,1)% %DIF PREAMBLE
\thicklines\linethickness{2pt} %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,0){\framebox(1,1){}}}% %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,0){\line( 1,1){1}}}% %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,1){\line(1,-1){1}}}% %DIF PREAMBLE
\end{picture}% %DIF PREAMBLE
}\hspace*{3pt}}} %DIF PREAMBLE
} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddbegin}{\DIFaddbegin} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddend}{\DIFaddend} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelbegin}{\DIFdelbegin} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelend}{\DIFdelend} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddbegin}{\DIFOaddbegin \let\includegraphics\DIFaddincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddend}{\DIFOaddend \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelbegin}{\DIFOdelbegin \let\includegraphics\DIFdelincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelend}{\DIFOaddend \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddbeginFL}{\DIFaddbeginFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddendFL}{\DIFaddendFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelbeginFL}{\DIFdelbeginFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelendFL}{\DIFdelendFL} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddbeginFL}{\DIFOaddbeginFL \let\includegraphics\DIFaddincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelbeginFL}{\DIFOdelbeginFL \let\includegraphics\DIFdelincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
%DIF COLORLISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
language=DIFcode, %DIF PREAMBLE
basicstyle=\ttfamily, %DIF PREAMBLE
columns=fullflexible, %DIF PREAMBLE
keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF
\begin{document}
% COVER PAGE
%!TEX root = ../main.tex
\begin{titlepage}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}} % Defines a new command for the horizontal lines, change thickness here
\center % Center everything on the page
% HEADING SECTION
%\textsc{\LARGE University Name}\\[1.5cm] % Name of your university/college
% \textsc{\Large Manuscript Submission}\\[0.5cm] % Major heading such as course name
% \textsc{\large The Journal of Blah Blah}\\[0.5cm] % Minor heading such as course title
% TITLE SECTION
\vspace*{\fill}
{\huge Inferring microbial co-occurrence networks from amplicon data: a systematic evaluation}\\[0.4cm]
% {\huge 2. Investigating the best practices for inference of microbial co-occurrence networks from 16S data}\\[0.4cm] % Title of your document
% {\huge 3. Attempting to find the best practice pipeline for inferring co-occurrence networks from 16S data}\\[0.4cm] % Title of your document
% {\huge 4. Deciphering the complexities in co-occurrence network inference from 16S data}\\[0.4cm] % Title of your document
% AUTHOR SECTION
\vspace{1.5 cm}
Dileep Kishore\textsuperscript{\DIFdelbegin \DIFdel{1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend ,\DIFdelbegin \DIFdel{2}\DIFdelend \DIFaddbegin \DIFadd{b}\DIFaddend },
Gabriel Birzu\textsuperscript{\DIFdelbegin \DIFdel{3}\DIFdelend \DIFaddbegin \DIFadd{c}\DIFaddend ,\DIFdelbegin \DIFdel{6}\DIFdelend \DIFaddbegin \DIFadd{f}\DIFaddend },
Zhenjun Hu\textsuperscript{\DIFdelbegin \DIFdel{1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend },
Charles DeLisi\textsuperscript{\DIFdelbegin \DIFdel{1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend ,\DIFdelbegin \DIFdel{3}\DIFdelend \DIFaddbegin \DIFadd{c}\DIFaddend },
Kirill S. Korolev\textsuperscript{\DIFdelbegin \DIFdel{$\dagger$1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend ,\DIFdelbegin \DIFdel{2}\DIFdelend \DIFaddbegin \DIFadd{b}\DIFaddend ,\DIFdelbegin \DIFdel{3}\DIFdelend \DIFaddbegin \DIFadd{c}\DIFaddend }\DIFaddbegin \DIFadd{\#}\DIFaddend ,\\
Daniel Segr\`{e}\textsuperscript{\DIFdelbegin \DIFdel{$\dagger$1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend ,\DIFdelbegin \DIFdel{2}\DIFdelend \DIFaddbegin \DIFadd{b}\DIFaddend ,\DIFdelbegin \DIFdel{4}\DIFdelend \DIFaddbegin \DIFadd{d}\DIFaddend ,\DIFdelbegin \DIFdel{5}\DIFdelend \DIFaddbegin \DIFadd{e}\DIFaddend }\DIFaddbegin \DIFadd{\#}\DIFaddend \\
\vspace{1cm}
\textsuperscript{\DIFdelbegin \DIFdel{1}\DIFdelend \DIFaddbegin \DIFadd{a}\DIFaddend }Bioinformatics Program, Boston University, Boston, Massachusetts, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{2}\DIFdelend \DIFaddbegin \DIFadd{b}\DIFaddend }Biological Design Center, Boston University, Boston, Massachusetts, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{3}\DIFdelend \DIFaddbegin \DIFadd{c}\DIFaddend }Department of Physics, Boston University, Boston, Massachusetts, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{4}\DIFdelend \DIFaddbegin \DIFadd{d}\DIFaddend }Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{5}\DIFdelend \DIFaddbegin \DIFadd{e}\DIFaddend }Department of Biology, Boston University, Boston, Massachusetts, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{6}\DIFdelend \DIFaddbegin \DIFadd{f}\DIFaddend }Department of Applied Physics, Stanford University, Stanford, California, USA\\
\textsuperscript{\DIFdelbegin \DIFdel{$\dagger$}\DIFdelend \DIFaddbegin \DIFadd{$\#$}\DIFaddend }Correspondence should be sent to \href{mailto:korolev@bu.edu}{korolev@bu.edu} or \href{mailto:dsegre@bu.edu}{dsegre@bu.edu}\\
\DIFaddbegin \vspace{2cm}
\DIFaddend % % DATE SECTION
% \vspace{1.5 cm}
% {\large Submitted: \today}\\[3cm] % Date, change the \today to a set date if you want to be precise
\vspace*{\fill} % Fill the rest of the page with whitespace
\end{titlepage}
%-----------------------------------------------------------------
\newpage
% Comment out to remove cover page
\thispagestyle{empty} %DIF < Removes header on page two. Only needed if there is a cover-page
%DIF > Removes header on page two.
% NOTE: Comment out the lines below to remove line numbers
% Running line numbers:
\linenumbers
\setlength\linenumbersep{15pt}
\renewcommand\linenumberfont{\normalfont\footnotesize\sffamily\color{gray}}
%\pagewiselinenumbers % Same, but that reset on every page:
\modulolinenumbers[1] % Number only every line. Change for fewer.
%--------------------------------------------------------%
% CONTENT
%--------------------------------------------------------%
% ABSTRACT
%!TEX root = ../main.tex
\begin{abstract}
{
\noindent
Microbes \DIFdelbegin \DIFdel{tend to }\DIFdelend \DIFaddbegin \DIFadd{commonly }\DIFaddend organize into communities consisting of hundreds of species involved in complex interactions with each other.
16S ribosomal RNA (16S rRNA) amplicon profiling provides snapshots that reveal the phylogenies and abundance profiles of these microbial communities.
These snapshots, when collected from multiple samples, \DIFdelbegin \DIFdel{have the potential to reveal which microbesco-occur}\DIFdelend \DIFaddbegin \DIFadd{can reveal the co-occurrence of microbes}\DIFaddend , providing a glimpse into the network of associations in these communities.
\DIFdelbegin \DIFdel{The }\DIFdelend \DIFaddbegin \DIFadd{However, the }\DIFaddend inference of networks from 16S data \DIFdelbegin \DIFdel{is prone to statistical artifacts.
There are many tools for performing each step of the 16S analysis workflow, but }\DIFdelend \DIFaddbegin \DIFadd{involves numerous steps, each requiring specific tools and parameter choices.
Moreover, }\DIFaddend the extent to which these steps affect the final network is still unclear.
In this study, we perform a meticulous analysis of each step of a pipeline that can convert 16S sequencing data into a network of microbial associations.
Through this process, we map how different choices of algorithms and parameters affect the co-occurrence network and \DIFdelbegin \DIFdel{estimate }\DIFdelend \DIFaddbegin \DIFadd{identify the }\DIFaddend steps that contribute \DIFdelbegin \DIFdel{most significantly }\DIFdelend \DIFaddbegin \DIFadd{substantially }\DIFaddend to the variance.
We further determine the tools and parameters that generate \DIFdelbegin \DIFdel{the most accurate and }\DIFdelend robust co-occurrence networks \DIFdelbegin \DIFdel{based on comparison }\DIFdelend \DIFaddbegin \DIFadd{and develop consensus network algorithms based on benchmarks }\DIFaddend with mock and synthetic datasets.
\DIFdelbegin \DIFdel{Ultimately, we develop a standardized pipeline }\DIFdelend \DIFaddbegin \DIFadd{The Microbial Co-occurrence Network Explorer or }\acs{micone} \DIFaddend (available at \href{https://github.com/segrelab/MiCoNE}{https://github.com/segrelab/MiCoNE})\DIFdelbegin \DIFdel{that }\DIFdelend \DIFaddbegin \DIFadd{, }\DIFaddend follows these default tools and parameters \DIFdelbegin \DIFdel{, but that can also }\DIFdelend \DIFaddbegin \DIFadd{and can }\DIFaddend help explore the outcome of \DIFdelbegin \DIFdel{any other combination of choices }\DIFdelend \DIFaddbegin \DIFadd{these combinations of choices on the inferred networks}\DIFaddend .
We envisage that this pipeline could be used for integrating multiple \DIFdelbegin \DIFdel{data-sets}\DIFdelend \DIFaddbegin \DIFadd{datasets}\DIFaddend , and for generating comparative analyses and consensus networks that can \DIFdelbegin \DIFdel{help understand and control }\DIFdelend \DIFaddbegin \DIFadd{guide our understanding of }\DIFaddend microbial community assembly in different biomes.
}
\end{abstract}
% Insert keywords here
\DIFdelbegin \DIFdel{\keywords{Microbiome, 16S rRNA, Pipeline, Interaction, Denoising, Taxonomy, Network Inference, Correlations, Qiime, Co-occurrence, Networks}
}\DIFdelend \DIFaddbegin \DIFadd{\keywords{Microbiome, 16S rRNA, Interaction, Denoising, Taxonomy, Network Inference, Correlations, QIIME2, Co-occurrence, Networks, Consensus algorithm, Pipeline, nextflow}
}\DIFaddend
\DIFdelbegin \section*{\DIFdel{Importance}}
%DIFAUXCMD
\DIFdelend %DIF > \doublespacing
\DIFdelbegin \DIFdel{To understand and control the mechanisms that determine the structure and function of microbial communities, it is important to map the interrelationships between its constituent microbial species }\DIFdelend \DIFaddbegin \section*{\DIFadd{Importance}}
\DIFadd{Mapping the interrelationships between different species in a microbial community is important for understanding and controlling their structure and function}\DIFaddend .
The surge in the high-throughput sequencing of microbial communities has led to the creation of thousands of datasets containing information about microbial abundances.
These abundances can be transformed into \DIFdelbegin \DIFdel{networks of co-occurrences across multiple samples}\DIFdelend \DIFaddbegin \DIFadd{co-occurrence networks}\DIFaddend , providing a glimpse into the \DIFdelbegin \DIFdel{structure of }\DIFdelend \DIFaddbegin \DIFadd{associations within }\DIFaddend microbiomes.
However, processing these datasets to obtain co-occurrence information relies on several complex steps, each of which involves \DIFdelbegin \DIFdel{multiple }\DIFdelend \DIFaddbegin \DIFadd{numerous }\DIFaddend choices of tools and corresponding parameters.
These multiple options pose questions about the \DIFdelbegin \DIFdel{accuracy }\DIFdelend \DIFaddbegin \DIFadd{robustness }\DIFaddend and uniqueness of the inferred networks.
In this study, we address this workflow and provide a systematic analysis of how these choices of tools \DIFdelbegin \DIFdel{and parameters }\DIFdelend affect the final network, and \DIFdelbegin \DIFdel{on how to select those that are most appropriate }\DIFdelend \DIFaddbegin \DIFadd{guidelines on appropriate tool selection }\DIFaddend for a particular dataset.
\DIFaddbegin \DIFadd{We also develop a consensus network algorithm that helps generate more robust co-occurrence networks based on benchmark synthetic datasets.
}\DIFaddend
\doublespacing
% INTRODUCTION
%!TEX root = ../main.tex
\section*{Introduction}
Microbial communities are ubiquitous and play an important role in marine and terrestrial environments, urban ecosystems, \DIFdelbegin \DIFdel{metabolic engineering, }\DIFdelend and human health \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Ghoul2016,Thompson2017}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{lima-mendezDeterminantsCommunityStructure2015a,Thompson2017,royo-llonchCompendium530Metagenomeassembled2021,tedersooFungalBiogeographyGlobal2014,dankoGlobalMetagenomicMap2021,mclellanMicrobiomeUrbanWaters2015,HumanMicrobiomeProjectConsortium2012}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
These microbial communities, or microbiomes, often comprise several hundreds of different microbial strains interacting with each other and their environment, often through \DIFdelbegin \DIFdel{intricate }\DIFdelend \DIFaddbegin \DIFadd{complex }\DIFaddend metabolic and signaling relationships\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
\cite{zelezniakMetabolicDependenciesDrive2015,Ghoul2016,coyteUnderstandingCompetitionCooperation2019,DSouza2018}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
Understanding how these interconnections shape community structure and \DIFdelbegin \DIFdel{functionalities }\DIFdelend \DIFaddbegin \DIFadd{function }\DIFaddend is a fundamental challenge in microbial ecology, \DIFdelbegin \DIFdel{with }\DIFdelend \DIFaddbegin \DIFadd{and has }\DIFaddend applications in the study of microbial ecosystems across different biomes.
With the advancement in DNA sequencing technologies\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Narihiro2017} }\hskip0pt%DIFAUXCMD
and data processing methods}\DIFdelend \DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
\cite{huNextgenerationSequencingTechnologies2021,buermansNextGenerationSequencing2014,Narihiro2017}}\hskip0pt%DIFAUXCMD
}\DIFaddend , more information can be extracted from these microbial community samples than ever before.
In particular, high-throughput sequencing, including \DIFdelbegin \DIFdel{community }\DIFdelend metagenomic sequencing and sequencing of 16S rRNA gene amplicons \DIFdelbegin \DIFdel{, has the potential to }\DIFdelend \DIFaddbegin \DIFadd{(hereafter referred to as 16S data) of microbial communities, can }\DIFaddend help detect, identify and quantify a large portion of the constitutive microorganisms of a microbiome \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Jovel2016,Lloyd-Price2016}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{ju16SRRNAGene2015,Jovel2016,quinceShotgunMetagenomicsSampling2017,sedlarBioinformaticsStrategiesTaxonomy2017}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
These advances have led to large-scale data collection efforts involving \DIFdelbegin \DIFdel{environmental (}%DIFDELCMD < \acl{emp}%%%
\DIFdel{) \mbox{%DIFAUXCMD
\cite{Thompson2017}}\hskip0pt%DIFAUXCMD
, marine(Tara Oceans Project) \mbox{%DIFAUXCMD
\cite{Zhang2015} }\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{terrestrial~\mbox{%DIFAUXCMD
\cite{Thompson2017,gilbertMeetingReportTerabase2010,tedersooFungalBiogeographyGlobal2014}}\hskip0pt%DIFAUXCMD
, marine~\mbox{%DIFAUXCMD
\cite{lima-mendezDeterminantsCommunityStructure2015a,royo-llonchCompendium530Metagenomeassembled2021} }\hskip0pt%DIFAUXCMD
}\DIFaddend and human-associated microbiota\DIFdelbegin \DIFdel{(Human Microbiome Project) \mbox{%DIFAUXCMD
\cite{HumanMicrobiomeProjectConsortium2012}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
\cite{HumanMicrobiomeProjectConsortium2012,proctorIntegrativeHumanMicrobiome2019,Lloyd-Price2016}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
This wealth of information \DIFdelbegin \DIFdel{on the composition and functions of a community at different times and under different environmental conditions }\DIFdelend has the potential to help us understand how communities assemble and operate.
\DIFdelbegin \DIFdel{A }\DIFdelend \DIFaddbegin \DIFadd{In particular, a }\DIFaddend powerful tool for translating microbiome \DIFaddbegin \DIFadd{composition }\DIFaddend data into knowledge is the construction of \DIFdelbegin \DIFdel{possible inter-dependence networksacross species.
}\DIFdelend \DIFaddbegin \DIFadd{association (co-occurrence or correlation) networks, in which microbial taxa are represented by nodes, and frequent co-occurrences (or negative correlations) across datasets are encoded as edges between nodes.
While the relationship between directly measured interactions~\mbox{%DIFAUXCMD
\cite{lubbeExometabolomicAnalysisCrossFeeding2017,Jian2020,Hsu2019} }\hskip0pt%DIFAUXCMD
and statistically inferred co-occurrence is still poorly understood \mbox{%DIFAUXCMD
\cite{Zuniga2017,Rottjers2018}}\hskip0pt%DIFAUXCMD
, a significant amount of effort has gone into estimating correlations from large microbiome sequence datasets~\mbox{%DIFAUXCMD
\cite{faustMicrobialCooccurrenceRelationships2012,leeCrosskingdomCooccurrenceNetworks2022,faustMicrobialInteractionsNetworks2012a,maEarthMicrobialCooccurrence2020a}}\hskip0pt%DIFAUXCMD
.
}
\DIFaddend The importance of these networks \DIFdelbegin \DIFdel{of relationships is two fold}\DIFdelend \DIFaddbegin \DIFadd{is two-fold}\DIFaddend : first, \DIFdelbegin \DIFdel{such networks }\DIFdelend \DIFaddbegin \DIFadd{they }\DIFaddend can serve as maps that help identify hubs of keystone species \cite{Menon2018,Rottjers2018}, \DIFdelbegin \DIFdel{or basic microbiome changes that occur as a consequence of }\DIFdelend \DIFaddbegin \DIFadd{and the community response to }\DIFaddend environmental perturbations or underlying host conditions \cite{Gilbert2016}; second, \DIFdelbegin \DIFdel{networks of inter-dependencies }\DIFdelend \DIFaddbegin \DIFadd{they }\DIFaddend can serve as a \DIFdelbegin \DIFdel{key }\DIFdelend bridge towards building mechanistic models of microbial communities, greatly enhancing our capacity to understand and control them.
For example, multiple studies have shown the importance of specific microbial \DIFdelbegin \DIFdel{interactions }\DIFdelend \DIFaddbegin \DIFadd{associations }\DIFaddend in the healthy microbiome \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Lloyd-Price2016} }\hskip0pt%DIFAUXCMD
and others have shown how changes in these interactions can lead to }\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{Lloyd-Price2016,Wu2016,HumanMicrobiomeProjectConsortium2012} }\hskip0pt%DIFAUXCMD
and their role in }\DIFaddend dysbiosis \cite{Wang2017,Gilbert2016,Belizario2015}.
In the context of terrestrial \DIFdelbegin \DIFdel{bio-geochemistry}\DIFdelend \DIFaddbegin \DIFadd{biogeochemistry}\DIFaddend , co-occurrence networks \DIFdelbegin \DIFdel{have been proposed as a valuable approach towards reconstructing the processes leading to microbiome assembly \mbox{%DIFAUXCMD
\cite{Fierer2017}}\hskip0pt%DIFAUXCMD
, and understanding }\DIFdelend \DIFaddbegin \DIFadd{were shown to help understand microbiome assembly \mbox{%DIFAUXCMD
\cite{fiererEmbracingUnknownDisentangling2017}}\hskip0pt%DIFAUXCMD
, and }\DIFaddend the response of microbial communities to environmental perturbations \cite{Jiao2019}.
\DIFdelbegin \DIFdel{Direct high-throughput measurement of interactions, e.g. through co-culture micro-droplet experiments \mbox{%DIFAUXCMD
\cite{Hsu2019,Jian2020}}\hskip0pt%DIFAUXCMD
, or spatial visualization of natural communities \mbox{%DIFAUXCMD
\cite{Wilbert2020} }\hskip0pt%DIFAUXCMD
is possible, but it requires specific technological capabilities, and has yet to be extensively used.
In parallel, sequencing data across multiple samples can be used for estimating co-occurrence relationships between taxa.
While the the relationship between directly measured interactions and statistically inferred co-occurrence is still poorly understood \mbox{%DIFAUXCMD
\cite{Zuniga2017}}\hskip0pt%DIFAUXCMD
, a significant amount of effort has gone into estimating correlations from large microbiome sequence datasets.
Co-occurrence networks have microbial taxa as nodes, and edges that represent the frequent co-occurrence (or negative correlations) across different datasets.
}%DIFDELCMD <
%DIFDELCMD < %%%
\DIFdelend One of the most frequently used avenues for inferring co-occurrence networks is the parsing and analysis of 16S sequencing data \cite{Rottjers2018,Friedman2012}.
\DIFdelbegin \DIFdel{A large number of }\DIFdelend \DIFaddbegin \DIFadd{Numerous }\DIFaddend software tools and pipelines have been developed to analyze 16S sequencing data, \DIFdelbegin \DIFdel{often focused on addressing the many }\DIFdelend \DIFaddbegin \DIFadd{with a strong emphasis on the }\DIFaddend known limitations of this \DIFdelbegin \DIFdel{methodology}\DIFdelend \DIFaddbegin \DIFadd{method}\DIFaddend , including resolution, sequencing depth, compositional nature, sequencing errors\DIFaddbegin \DIFadd{, }\DIFaddend and copy number variations \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Bharti2019,Pollock2018}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{Bharti2019,pollockMadnessMicrobiomeAttempting2018}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
Popular methods for different phases of the analysis of 16S data include tools for: (i) \DIFdelbegin \DIFdel{denoising and clustering sequencing reads\mbox{%DIFAUXCMD
\cite{Caporaso2010,Callahan2016}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{quality checking and trimming the sequencing reads}\DIFaddend ; (ii) \DIFaddbegin \DIFadd{denoising and clustering the trimmed reads \mbox{%DIFAUXCMD
\cite{Caporaso2010,Callahan2016,Amir2017}}\hskip0pt%DIFAUXCMD
; (iii) }\DIFaddend assigning taxonomy to the \DIFdelbegin \DIFdel{reads \mbox{%DIFAUXCMD
\cite{DeSantis2006,Quast2012}}\hskip0pt%DIFAUXCMD
; (iii}\DIFdelend \DIFaddbegin \DIFadd{denoised reads \mbox{%DIFAUXCMD
\cite{bokulichOptimizingTaxonomicClassification2018}}\hskip0pt%DIFAUXCMD
; (iv}\DIFaddend ) processing and transforming the taxonomy count matrices \cite{Weiss2015}; and (\DIFdelbegin \DIFdel{iv}\DIFdelend \DIFaddbegin \DIFadd{v}\DIFaddend ) inferring the co-occurrence network \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Cougoul2019,Kurtz2015}}\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{Watts2018,Kurtz2015,tackmannRapidInferenceDirect2019}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
Different specific algorithms are often aggregated into popular \DIFaddbegin \DIFadd{online }\DIFaddend platforms (like MG-RAST\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Keegan2016}}\hskip0pt%DIFAUXCMD
, Qiita\mbox{%DIFAUXCMD
\cite{qiita}}\hskip0pt%DIFAUXCMD
) and }\DIFdelend \DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
\cite{keeganMGRASTMetagenomicsService2016}}\hskip0pt%DIFAUXCMD
, Qiita~\mbox{%DIFAUXCMD
\cite{gonzalezQiitaRapidWebenabled2018}}\hskip0pt%DIFAUXCMD
) and software }\DIFaddend packages (such as \DIFdelbegin \DIFdel{QIIME \mbox{%DIFAUXCMD
\cite{Caporaso2010}}\hskip0pt%DIFAUXCMD
)that provide pipelines for 16S data analysis}\DIFdelend \DIFaddbegin \ac{qiime2}\DIFadd{~\mbox{%DIFAUXCMD
\cite{bolyenReproducibleInteractiveScalable2019}}\hskip0pt%DIFAUXCMD
)}\DIFaddend .
The different methods and tools \DIFdelbegin \DIFdel{developed to solve issues arising in 16S analysis }\DIFdelend can lead to vastly different inferences of community compositions and co-occurrence networks \cite{Golob2017,Weiss2016}, making it difficult to reliably compare networks across different publications and studies.
This \DIFaddbegin \DIFadd{difference }\DIFaddend is partially due to the \DIFdelbegin \DIFdel{fact that existing platforms are typically focused }\DIFdelend \DIFaddbegin \DIFadd{focus of existing platforms }\DIFaddend on \ac{otu} \DIFaddbegin \DIFadd{or }\ac{esv} \DIFaddend generation and not on the effects of upstream statistical methods on the inferred co-occurrence networks.
Furthermore, no organized framework currently \DIFdelbegin \DIFdel{exist to }\DIFdelend \DIFaddbegin \DIFadd{exists that can }\DIFaddend systematically analyze and compare \DIFdelbegin \DIFdel{existing components of the data analysis from amplicons to networks.
More broadly, given the lack of comprehensive comparisons between directly observed microbial interactions (e.g. from co-culture experiments) and }\DIFdelend \DIFaddbegin \DIFadd{each step in the pipeline for processing amplicons into }\DIFaddend co-occurrence networks\DIFdelbegin \DIFdel{, there is no straightforward way to determine which set of tools or methods generate the most accurate networks}\DIFdelend .
In this study, we present a standardized 16S data analysis pipeline called \ac{micone} that produces robust and reproducible co-occurrence networks from \DIFdelbegin \DIFdel{community }\DIFdelend 16S sequence data \DIFdelbegin \DIFdel{, and allow }\DIFdelend \DIFaddbegin \DIFadd{of microbial communities, and enable }\DIFaddend users to interactively explore how the network would change upon using different alternative tools and parameters at each step.
%DIF < TODO: Link or describe MIND here (?)
Our pipeline is coupled to an online integrative tool for the organization, visualization\DIFaddbegin \DIFadd{, }\DIFaddend and analysis of inter-microbial networks \DIFdelbegin \DIFdel{.
In addition to making this tool freely available, we implemented }\DIFdelend \DIFaddbegin \DIFadd{called }\ac{mind}\DIFadd{~\mbox{%DIFAUXCMD
\cite{huResourceComparisonIntegration2022}}\hskip0pt%DIFAUXCMD
, which is available at }\href{http://microbialnet.org/}{\DIFadd{http://microbialnet.org/}}\DIFadd{.
Through }\DIFaddend a systematic comparative analysis\DIFdelbegin \DIFdel{to }\DIFdelend \DIFaddbegin \DIFadd{, we }\DIFaddend determine which steps of the \DIFaddbegin \ac{micone} \DIFaddend pipeline have the largest influence on the final network, and \DIFdelbegin \DIFdel{what }\DIFdelend \DIFaddbegin \DIFadd{which }\DIFaddend choice seems to \DIFdelbegin \DIFdel{provide best }\DIFdelend \DIFaddbegin \DIFadd{have the optimal }\DIFaddend agreement with the tested mock and synthetic datasets.
\DIFdelbegin \DIFdel{We believe that these steps will }\DIFdelend \DIFaddbegin \DIFadd{These steps together with our default settings }\DIFaddend ensure better reproducibility and easier comparison of co-occurrence networks across datasets.
We expect that our tool will also be useful for benchmarking future alternative methods, and for ensuring a transparent evaluation of the possible biases introduced by the use of specific tools.
% RESULTS
%!TEX root = ../main.tex
\section*{Results}
\subsection*{\acl{micone} (\acs{micone})}
We \DIFdelbegin \DIFdel{have }\DIFdelend developed \ac{micone}, a flexible and modular pipeline for \DIFdelbegin \DIFdel{16S amplicon sequencing rRNA data (hereafter mentioned simply as 16S data) analysis, that allows us to infer microbial }\DIFdelend \DIFaddbegin \DIFadd{the inference of }\DIFaddend co-occurrence networks \DIFdelbegin \DIFdel{.
It }\DIFdelend \DIFaddbegin \DIFadd{from 16S data.
}\ac{micone} \DIFaddend incorporates various popular, publicly available tools as well as custom Python modules \DIFdelbegin \DIFdel{and scripts to facilitate inference of co-occurrence networks from }\DIFdelend \DIFaddbegin \DIFadd{for }\DIFaddend 16S data \DIFdelbegin \DIFdel{(see }\DIFdelend \DIFaddbegin \DIFadd{analysis and network inference (}\DIFaddend Methods).
\DIFdelbegin \DIFdel{Using }\DIFdelend \DIFaddbegin \DIFadd{The different steps that are a part of the }\DIFaddend \ac{micone} \DIFdelbegin \DIFdel{one can obtain }\DIFdelend co-occurrence \DIFdelbegin \DIFdel{networks by applying to 16S data (or to already processed taxonomic count matrices) any combination of the available tools.
The effects of changing any of the intermediate step can be monitored and evaluated in terms of its final network outcome, as well as on any of the intermediate metrics and data outputs.
The }%DIFDELCMD < \ac{micone} %%%
\DIFdel{pipeline workflow is shown in }\DIFdelend \DIFaddbegin \DIFadd{network inference workflow (}\DIFaddend Figure~\ref{fig:figure1}\DIFdelbegin \DIFdel{.
The different steps for going from 16S data to co-occurrence networks }\DIFdelend \DIFaddbegin \DIFadd{) }\DIFaddend can be grouped into \DIFdelbegin \DIFdel{four }\DIFdelend \DIFaddbegin \DIFadd{five }\DIFaddend major modules; (i) \DIFdelbegin \DIFdel{the denoising and clustering (DC) step, which handles denoising of the raw 16S sequencing data into representative sequences}\DIFdelend \DIFaddbegin \ac{sp}\DIFaddend ; (ii) \DIFdelbegin \DIFdel{the taxonomy assignment (TA) step that assigns taxonomic labels to the representative sequences}\DIFdelend \DIFaddbegin \ac{dc}\DIFaddend ; (iii) \DIFdelbegin \DIFdel{the }%DIFDELCMD < \ac{otu} %%%
\DIFdel{processing (OP) step that filters and transforms the taxonomy abundance table; and finally (}\DIFdelend \DIFaddbegin \ac{ta}\DIFadd{; (}\DIFaddend iv) \DIFdelbegin \DIFdel{the network inferences (NI) step which infers the microbial co-occurrence network}\DIFdelend \DIFaddbegin \ac{op}\DIFadd{; and (v) }\ac{ni}\DIFaddend .
Each process in the pipeline \DIFdelbegin \DIFdel{supports alternate tools for performing the same task }\DIFdelend \DIFaddbegin \DIFadd{is implemented through multiple tools }\DIFaddend (see Methods and Figure~\ref{fig:figure1}).
\DIFdelbegin \DIFdel{A centralized configuration file contains all the specifications for what modules are used in the pipeline , and can be modified by the user to choose the desired set of tools .
In what follows, we perform }\DIFdelend \DIFaddbegin \DIFadd{The effects of changing any intermediate step of the pipeline can be evaluated in terms of the final network outcome, as well as on any of the intermediate metrics and data outputs.
The choice of tools and parameters is encoded in a configuration file (with parameters as shown in Tables S2-S6 at }\href{https://github.com/segrelab/MiCoNE-pipeline-paper}{\DIFadd{https://github.com/segrelab/MiCoNE-pipeline-paper}}\DIFadd{).
Through }\DIFaddend a systematic analysis of \DIFaddbegin \DIFadd{tool combinations at }\DIFaddend each step of the pipeline\DIFdelbegin \DIFdel{to estimate }\DIFdelend \DIFaddbegin \DIFadd{, we estimated }\DIFaddend how much the final co-occurrence network depends on the possible choices at each step.
\DIFdelbegin \DIFdel{We also evaluate a large number of tool combinations to determine a set of recommended default options for the pipeline and provide the users with a set of guidelines to facilitate tool selection as appropriate for their data.
}\DIFdelend
Our analysis \DIFdelbegin \DIFdel{involves }\DIFdelend \DIFaddbegin \DIFadd{involved }\DIFaddend two types of data: The first type \DIFdelbegin \DIFdel{consists of sets of }\DIFdelend \DIFaddbegin \DIFadd{consisted of }\DIFaddend 16S sequencing data from \DIFdelbegin \DIFdel{real communities sampled from human Stool and Oral microbiomes }\DIFdelend \DIFaddbegin \DIFadd{samples of human stool microbiomes from a fecal microbiome transplant (FMT) study of autism~\mbox{%DIFAUXCMD
\cite{Kang2017}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
The second \DIFdelbegin \DIFdel{are }\DIFdelend \DIFaddbegin \DIFadd{type was a collection of }\DIFaddend datasets synthetically or artificially created for the specific goal of \DIFdelbegin \DIFdel{helping evaluate }\DIFdelend \DIFaddbegin \DIFadd{evaluating }\DIFaddend computational analysis tools\DIFdelbegin \DIFdel{(see Methods)}\DIFdelend .
In particular, in order to \DIFdelbegin \DIFdel{objectively compare, to the extent possible, how well }\DIFdelend \DIFaddbegin \DIFadd{benchmark }\DIFaddend each step in \ac{micone}\DIFdelbegin \DIFdel{best captures the underlying data, we use }\DIFdelend \DIFaddbegin \DIFadd{, we used }\DIFaddend both mock data (\DIFdelbegin \DIFdel{labelled }\DIFdelend \DIFaddbegin \DIFadd{labeled }\DIFaddend mock4, mock12\DIFaddbegin \DIFadd{, }\DIFaddend and mock16) from mockrobiota~\cite{Bokulich2016} \DIFdelbegin \DIFdel{as well as, synthetically generated reads from an Illumina read simulator called ART~\mbox{%DIFAUXCMD
\cite{Huang2012}}\hskip0pt%DIFAUXCMD
.
These mock datasets consist of fake sequencing reads generated from reads obtained from synthetic microbial isolates mixed in know proportions. They contain the expected compositions along with the reference sequences for the organisms in the mock community.
The synthetic reads were simulated using three different taxonomy distribution profiles, namely soil and water microbiomes obtained }%DIFDELCMD < \ac{emp}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Thompson2017} }\hskip0pt%DIFAUXCMD
and Stool microbiome that is used in our real community analysis~\mbox{%DIFAUXCMD
\cite{Kang2017}}\hskip0pt%DIFAUXCMD
.
Reference sequences were generated using }%DIFDELCMD < \ac{ncbi} %%%
\DIFdel{and the Decard package~\mbox{%DIFAUXCMD
\cite{Golob2017} }\hskip0pt%DIFAUXCMD
for these taxonomy profiles.
Detailed information on the mock communities and the settings used to generate the synthetic data are provided in the Methodssection}\DIFdelend \DIFaddbegin \DIFadd{and synthetic networks generated using the NorTA~\mbox{%DIFAUXCMD
\cite{Kurtz2015} }\hskip0pt%DIFAUXCMD
and seqtime~\mbox{%DIFAUXCMD
\cite{Rottjers2018} }\hskip0pt%DIFAUXCMD
approaches (See Methods)}\DIFaddend .
\FloatBarrier
\subsection*{\DIFdelbegin \DIFdel{The choice }\DIFdelend \DIFaddbegin \DIFadd{DC: Denoising and clustering methods differ in their identification }\DIFaddend of \DIFdelbegin \DIFdel{reference database has the biggest impact on inferred networks}\DIFdelend \DIFaddbegin \DIFadd{sequences that are low in abundance}\DIFaddend }
\DIFdelbegin \DIFdel{In order to analyze the effect of different statistical methods on the inferred co-occurrence networks, we generated co-occurrence networks using all possible combinations of methods and estimated the variability in the networks due to each choice (Figure \ref{fig:figure1}).
This analysis is performed while keeping the network inference algorithm (NI step) the same throughout the analysis.
The effects of various steps on the final co-occurrence network is estimated by building a linear model of the edges of the network as a function the various step in the analysis pipeline (see Methods).
Figure \ref{fig:figure2}B, shows the fraction of total variation among the co-occurrence networks due to the first three steps of the pipeline. In other words, each point corresponds to a different combination of tools, and captures how much the final network is affected by such choice.
The 16S reference database contributes the most ($\sim25\%$) to variation in the networks. This is also reflected in the fact that the networks can be clearly separated based on the database used (Figure \ref{fig:figure2}B).
This indicates that the taxonomy assigned to the reference sequences drastically alters the co-occurrence network.
In fact the variability induced by taxonomy assignment is much more significant than that due to the variability induced based on how the reference sequences themselves are identified }\DIFdelend \DIFaddbegin \DIFadd{The }\ac{dc} \DIFadd{step is commonly carried out to generate representative sequences }\DIFaddend (in the \DIFdelbegin \DIFdel{DC step).
The grouping of the networks by taxonomy assignment into clusters (Figure~\ref{fig:figure2}B) seems to derive from the mislabelling of constitutive taxa that are present in high abundance in the community, which drastically alter the nodes and hence the underlying network topology.
The residual variation (Figure \ref{fig:figure2}A) can be seen as an artifact that arises when multiple steps are changed at the same time.
Another interesting observation (elaborated in detail in the denoising and clustering section) is that the dissimilarity between the networks decreases when the low abundance }%DIFDELCMD < \ac{otu}%%%
\DIFdel{s are removed from the network.
These results suggest that the most important criterion for accurate comparative analyses of co-occurrence networks is the taxonomy reference database.
}%DIFDELCMD <
%DIFDELCMD < \FloatBarrier
%DIFDELCMD <
%DIFDELCMD < %%%
\subsection*{\DIFdel{Denoising and clustering methods differ in their identification of less common reference sequences}}
%DIFAUXCMD
%DIFDELCMD <
%DIFDELCMD < %%%
\DIFdel{Denoising and clustering are commonly carried out to generate representative sequences from the raw }\DIFdelend \DIFaddbegin \DIFadd{form of the }\acs{otu}\DIFadd{/}\acs{esv} \DIFadd{tables) from the demultiplexed and trimmed }\DIFaddend 16S sequencing data\DIFdelbegin \DIFdel{and to obtain the }%DIFDELCMD < \ac{otu}%%%
\DIFdel{/}%DIFDELCMD < \ac{esv} %%%
\DIFdel{tables (counts of these representative sequences for each sample)}\DIFdelend .
In order to compare the \DIFdelbegin %DIFDELCMD < \ac{otu} %%%
\DIFdelend \DIFaddbegin \DIFadd{count }\DIFaddend tables generated by different tools\DIFaddbegin \DIFadd{, }\DIFaddend we processed the \DIFdelbegin \DIFdel{same }\DIFdelend 16S sequencing reads (\DIFdelbegin \DIFdel{healthy samples from a fecal microbiome transplant }\DIFdelend \DIFaddbegin \DIFadd{from the FMT }\DIFaddend study~\cite{Kang2017}) using 5 different methods: open-reference clustering, closed-reference clustering, \DIFdelbegin \DIFdel{denovo }\DIFdelend \DIFaddbegin \DIFadd{de novo }\DIFaddend clustering, \ac{dada2}~\cite{Callahan2016} and Deblur~\cite{Amir2017}.
The first three methods are from the \DIFdelbegin %DIFDELCMD < \ac{qiime1}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Caporaso2010} }\hskip0pt%DIFAUXCMD
package.
We find that there is good agreement in the }%DIFDELCMD < \ac{otu}%%%
\DIFdel{/}%DIFDELCMD < \ac{esv} %%%
\DIFdel{tables when different combinations of methods are used to generate them (Supplementary Figure~\ref{fig:figureS1}).
}\DIFdelend \DIFaddbegin \DIFadd{vsearch plugin from }\ac{qiime2}\DIFadd{~\mbox{%DIFAUXCMD
\cite{bolyenReproducibleInteractiveScalable2019}}\hskip0pt%DIFAUXCMD
.
The closed and open reference methods in this analysis use the }\acl{gg}\DIFadd{~\mbox{%DIFAUXCMD
\cite{DeSantis2006} }\hskip0pt%DIFAUXCMD
database for reference sequence alignment.
}\DIFaddend
\DIFdelbegin \DIFdel{To compare the representative sequences generated by these methods we employ }\DIFdelend \DIFaddbegin \DIFadd{A comparison of the different methods was carried out by calculating the mean UniFrac distances across all samples (Figure~\ref{fig:figure2}).
The analysis was performed using }\DIFaddend both the weighted \DIFaddbegin \DIFadd{UniFrac}\DIFaddend ~\cite{Lozupone2007} (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure3}A) and unweighted UniFrac method~\mbox{%DIFAUXCMD
\cite{Lozupone2005} }\hskip0pt%DIFAUXCMD
(Figure~\ref{fig:figure3}B).
The weighted UniFrac distance metric}\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure2}A) distance metric, which }\DIFaddend takes into account the counts of the representative sequences, \DIFdelbegin \DIFdel{whereas }\DIFdelend \DIFaddbegin \DIFadd{and }\DIFaddend the unweighted UniFrac\DIFdelbegin \DIFdel{distance metric does not and hence }\DIFdelend \DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
\cite{Lozupone2005} }\hskip0pt%DIFAUXCMD
(Figure~\ref{fig:figure2}B) distance metric, which }\DIFaddend gives equal weights to each sequence.
\DIFdelbegin \DIFdel{From Figure~\ref{fig:figure3}A one can see }\DIFdelend \DIFaddbegin
\DIFadd{The first main message emerging from this analysis is }\DIFaddend that the representative sequences generated by the different methods\DIFaddbegin \DIFadd{, with the exception of Deblur, }\DIFaddend are similar to each other when weighted by their abundance \DIFdelbegin \DIFdel{.
Figure~\ref{fig:figure3}B on the other hand shows an increase in dissimilarity between each pair of methods suggesting that the methods might differ in the treatment }\DIFdelend \DIFaddbegin \DIFadd{(Figure~\ref{fig:figure2}A).
A second message is that the different methods differ mainly in the assignment }\DIFaddend of sequences of \DIFdelbegin \DIFdel{low abundance.
In order to verify this claim, for each of these methods we use the }%DIFDELCMD < \ac{gg} %%%
\DIFdel{taxonomy database to assign taxonomies to the representative sequences.
We then correlate the abundances of matching taxonomies between a pair of DC methods (Figure\ref{fig:figureS1}A and B) .
The }%DIFDELCMD < \ac{esv} %%%
\DIFdel{tables generated by methods that perform denoising are very similar to each other ($\sim0.91$) and the }%DIFDELCMD < \ac{otu} %%%
\DIFdel{tables generated by the clustering methods are very similar to each other ($\sim0.9$), but results of denoising and clustering are highly uncorrelated with each other ($\sim0.4$) (Figure \ref{fig:figureS1}C}\DIFdelend \DIFaddbegin \DIFadd{lower abundance.
This can be inferred from the unweighted comparison (Figure~\ref{fig:figure2}B) which shows an increase in dissimilarity between each pair of methods (see additional details in Supplementary and Figure \ref{fig:figure_s2}}\DIFaddend ).
These comparisons only elucidate the \DIFdelbegin \DIFdel{pairwise similarity or dissimilarity of }\DIFdelend \DIFaddbegin \DIFadd{similarity between }\DIFaddend a pair of methods.
\DIFdelbegin \DIFdel{In order to determine the tool that }\DIFdelend \DIFaddbegin \DIFadd{To determine which tool }\DIFaddend most accurately recapitulates the reference sequences in the samples, we \DIFdelbegin \DIFdel{used the 16S sequences from the mock datasets.
In particular, we used the pipeline to process mock community datasets using each of the possible methods included for this step.
We next compared }\DIFdelend \DIFaddbegin \DIFadd{applied the same pipeline step to process the mock datasets (mock4, mock12, and mock16) and compared the }\DIFaddend predicted representative sequences with \DIFdelbegin \DIFdel{expected representative }\DIFdelend \DIFaddbegin \DIFadd{the true }\DIFaddend sequences and their distribution.
The results (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure3}C and }\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure2}C and \ref{fig:figure2}}\DIFaddend D) show that \DIFdelbegin \DIFdel{, for the mock datasets, the different methods perform similar to each other, exactly as observed in the case of the real dataset. However, the mock }\DIFdelend \DIFaddbegin \DIFadd{the }\DIFaddend predicted sequence distributions are \DIFdelbegin \DIFdel{substantially }\DIFdelend \DIFaddbegin \DIFadd{overall }\DIFaddend different from the expected \DIFdelbegin \DIFdel{sequence distribution.
This result is more exaggerated in the case of the unweighted UniFrac metric, where some of the datasets show a very high deviation from the expected sequences.
These high deviations are primarily in two of the three datasets that were analyzed and show that }\DIFdelend \DIFaddbegin \DIFadd{ones.
The variation across datasets indicates that }\DIFaddend the datasets themselves play a big role in \DIFdelbegin \DIFdel{the performanceof these methods.
This can be clearly seen in the performance (weighted UniFrac distance) of }%DIFDELCMD < \ac{dada2} %%%
\DIFdel{and Deblur on mock12 and mock16 datasets, where, Deblur outperforms }%DIFDELCMD < \ac{dada2} %%%
\DIFdel{on mock12 but the under-performs on mock16}\DIFdelend \DIFaddbegin \DIFadd{method performance}\DIFaddend .
\DIFdelbegin %DIFDELCMD <
%DIFDELCMD < %%%
\DIFdel{There }\DIFdelend \DIFaddbegin \DIFadd{We note that there }\DIFaddend is no method that \DIFdelbegin \DIFdel{clearly }\DIFdelend outperforms the rest in all datasets \DIFaddbegin \DIFadd{(see Supplementary for an extended discussion)}\DIFaddend .
Based on their slightly better performance on the mock datasets, their \DIFdelbegin \textit{\DIFdel{de novo}} %DIFAUXCMD
\DIFdelend \DIFaddbegin \DIFadd{de novo }\DIFaddend error correcting nature and \DIFdelbegin \DIFdel{other previous studies}\DIFdelend \DIFaddbegin \DIFadd{previous independent evaluation}\DIFaddend ~\cite{Nearing2018}, \ac{dada2} and Deblur \DIFdelbegin \DIFdel{seem to be in general }\DIFdelend \DIFaddbegin \DIFadd{appear to be }\DIFaddend the most reliable.
\DIFdelbegin \DIFdel{Given the unexpected poor performance of Deblur on the synthetic data, the default algorithm in the pipeline was chosen to be }%DIFDELCMD < \ac{dada2} %%%
\DIFdel{(Supplementary Figure~\ref{fig:figureS3}).
}\DIFdelend \DIFaddbegin \DIFadd{This is because the open-reference and de novo clustering methods return a much larger number of }\ac{otu}\DIFadd{s compared to the other pipelines and would affect the accuracy of the network inference step if stringent filtering is not performed.
Overall, since }\ac{dada2} \DIFadd{as compared to Deblur, displays better performance on all the mock datasets on the weighted UniFrac metric, we set this tool as the default for the DC step of the pipeline.
However, if comparison across studies that have sequenced different 16S regions is required, closed-reference and open-reference might be a better option.
}\DIFaddend
\DIFaddbegin \DIFadd{After the denoising, the sequences are subject to Chimera Checking (CC).
The }\ac{micone} \DIFadd{pipeline supports two different chimera checking methods, ``uchime-denovo"~\mbox{%DIFAUXCMD
\cite{bolyenReproducibleInteractiveScalable2019}}\hskip0pt%DIFAUXCMD
, and ``remove bimera"~\mbox{%DIFAUXCMD
\cite{Callahan2016}}\hskip0pt%DIFAUXCMD
.
We did not notice any notable difference between the two methods (Figure~\ref{fig:figure_s3}), implying that they identify and remove mostly the same set of sequences as chimeras.
Since the remove bimera method was originally developed in conjunction with dada2 we use this method as the default.
The DC step thus results in a reduced set of unique sequences, which will be referred to as representative sequences in the subsequent steps.
}
\DIFaddend \FloatBarrier
\subsection*{\DIFaddbegin \DIFadd{TA: }\DIFaddend Taxonomy databases vary widely in taxonomy \DIFdelbegin \DIFdel{hierarchy and update frequency}\DIFdelend \DIFaddbegin \DIFadd{assignments beyond Order level}\DIFaddend }
Taxonomy databases are used to assign taxonomic identities to the representative sequences obtained after the DC step.
\DIFdelbegin \DIFdel{In order to compare the assigned taxonomies from different databases, we use the same reference sequences and assign taxonomies to them using different taxonomy reference databases.
}\DIFdelend The three 16S taxonomic reference databases used in this study are SILVA~\cite{Quast2012}, \ac{gg}~\cite{DeSantis2006} and \ac{ncbi} RefSeq~\cite{Sayers2009} \DIFdelbegin \DIFdel{.
SILVA and }%DIFDELCMD < \ac{gg} %%%
\DIFdel{are two popular 16S databases used for taxonomy identification.
The }%DIFDELCMD < \ac{ncbi} %%%
\DIFdel{RefSeq nucleotide database contains 16S rRNA sequences as a part of two BioProjects - 33175 and 33317.
The three databases vastly differ in terms of their last update status - }%DIFDELCMD < \ac{gg} %%%
\DIFdel{was last updated on May 2013, SILVA was last updated on December 2017 at the time of writing and }%DIFDELCMD < \ac{ncbi} %%%
\DIFdel{is updated as new sequences are curated.
Since updates to taxonomic classifications are frequent, these databases vary significantly }\DIFdelend \DIFaddbegin \DIFadd{(Methods).
These databases vary substantially }\DIFaddend in terms of taxonomy hierarchies\DIFaddbegin \DIFadd{, }\DIFaddend including species names and phylogenetic relationships~\cite{Balvociute2017}.
\DIFaddbegin \DIFadd{Assignment using a particular database also requires a query tool.
We used the ``Naive Bayes'' classifier from }\ac{qiime2} \DIFadd{for the }\ac{gg} \DIFadd{and SILVA databases and the ``BLAST'' tool (included as a }\ac{qiime2} \DIFadd{plugin) for the }\ac{ncbi} \DIFadd{database.
These tools have been well quantified and optimized~\mbox{%DIFAUXCMD
\cite{bokulichOptimizingTaxonomicClassification2018}}\hskip0pt%DIFAUXCMD
, hence, we made use of the default parameters in our analyses.
}\DIFaddend
The representative sequences obtained \DIFdelbegin \DIFdel{from the }%DIFDELCMD < \ac{dada2} %%%
\DIFdel{method in }\DIFdelend \DIFaddbegin \DIFadd{using the default settings of the }\DIFaddend DC step were used for taxonomic assignment using the three reference databases.
Figure~\DIFdelbegin \DIFdel{\ref{fig:figure4}}\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure3}}\DIFaddend A depicts a flow diagram that shows how the top 50 representative sequences (sorted by abundance) are assigned a Genus according to the three \DIFdelbegin \DIFdel{different databases.
We observe that not only does }\DIFdelend \DIFaddbegin \DIFadd{databases.
The different databases lead to assignments that qualitatively display similar distributions. However, }\DIFaddend the assigned Genus \DIFdelbegin \DIFdel{composition vary significantly, but }\DIFdelend \DIFaddbegin \DIFadd{compositions also display clear differences, as does }\DIFaddend the percentage of unassigned representative sequences (\DIFdelbegin \DIFdel{gray)also differ.
Even the most abundant }\DIFdelend \DIFaddbegin \DIFadd{pink).
Some of the differences in Genus composition have a clear explanation, for example, abundant Genera like Bacteroides and Escherichia are assigned to different representative sequences.
The large percentage of unassigned sequences is due to the large fraction of the }\DIFaddend representative \DIFdelbegin \DIFdel{sequence is assigned to }\DIFdelend \DIFaddbegin \DIFadd{sequences assigned to }\DIFaddend an "unknown" \DIFdelbegin \DIFdel{Genus in two of the three databases.
A representative sequence might be assigned an "unknown" }\DIFdelend Genus \DIFdelbegin \DIFdel{for one of two reasons: the first is if the taxonomy identifier associated with the sequence in the database did not contain a Genus; the second (more likely)reason is that the database contains multiple sequences that are very similar to the query (representative) sequence and the consensus algorithm (from }%DIFDELCMD < \ac{qiime2}%%%
\DIFdel{) is unable to assign one particular Genus at the required confidence.
After assigning all the representative sequences to taxonomies we perform }\DIFdelend \DIFaddbegin \DIFadd{during the assignment process (Methods).
}
\DIFadd{After the assignment, we performed }\DIFaddend a pairwise comparison of the similarity between \DIFdelbegin \DIFdel{assignments }\DIFdelend \DIFaddbegin \DIFadd{the top 100 assignments (by abundance) }\DIFaddend from different databases at every taxonomic level (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure4}}\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure3}}\DIFaddend B).
The \DIFdelbegin \DIFdel{assignments beyond Family }\DIFdelend \DIFaddbegin \DIFadd{comparisons of the assignments below the Order }\DIFaddend level (Family, Genus\DIFaddbegin \DIFadd{, }\DIFaddend and Species) \DIFdelbegin \DIFdel{are very dissimilar with $<70\%$ }\DIFdelend \DIFaddbegin \DIFadd{show less than $45\%$ }\DIFaddend similarity between any pair of databases.
\DIFdelbegin \DIFdel{There are no two reference databases that are more similar than the other pairs, with }%DIFDELCMD < \ac{gg} %%%
\DIFdel{and SILVA producing only marginally similar assignments compared to }%DIFDELCMD < \ac{ncbi}%%%
\DIFdel{.
}\DIFdelend This implies that the taxonomy assignments from each reference database are fairly unique\DIFdelbegin \DIFdel{and are largely responsible for the differences observed in the co-occurrence networks generated from different taxonomy databases.
}%DIFDELCMD <
%DIFDELCMD < %%%
\DIFdel{Supplementary Figure~\ref{fig:figureS4} shows that the top 20 most abundant genera in the three resulting taxonomy composition tables are different.
For example, }\DIFdelend \DIFaddbegin \DIFadd{.
The comparison of all assigned genera (Figure~\ref{fig:figure_s4}), instead of just the top 100, contains a higher percentage of mismatches.
This suggests that, comparatively, }\DIFaddend the most abundant \DIFdelbegin \DIFdel{genus in the }%DIFDELCMD < \ac{gg} %%%
\DIFdel{taxonomy table was }\textit{\DIFdel{Escherichia}} %DIFAUXCMD
\DIFdel{whereas in the SILVA taxonomy table it was }\textit{\DIFdel{Escherichia-Shigella}}%DIFAUXCMD
\DIFdel{.
Although these are minor differences, when comparing a large number of taxonomy composition tables these problems are hard to diagnose.
%DIF < The comparison of all assigned genera instead of the just the top 20 contains the same percentage of matches and mismatches, implying that there does not seem to exist a correlation between abundance and mismatch.
%DIF < This suggests that the most abundant sequences are not necessarily the ones that are consistently matched to the same taxonomies in the different reference databases.
}\DIFdelend \DIFaddbegin \DIFadd{sequences are more consistently matched to the same taxonomies, at least for the dataset tested in the current analysis.
}\DIFaddend
\DIFdelbegin \DIFdel{As in the previous section, these comparisons only indicate similarity or dissimilarity between methods.
In order to }\DIFdelend \DIFaddbegin \DIFadd{To }\DIFaddend obtain an absolute measure of \DIFaddbegin \DIFadd{the }\DIFaddend accuracy of the taxonomic assignments\DIFdelbegin \DIFdel{we use the expected reference }\DIFdelend \DIFaddbegin \DIFadd{, we used the representative }\DIFaddend sequences from the \DIFaddbegin \DIFadd{DC step for }\DIFaddend mock datasets as the query sequences \DIFdelbegin \DIFdel{for the databases }\DIFdelend and the expected taxonomic composition as the standard to compare against\DIFaddbegin \DIFadd{.
We used the Bray-Curtis distance metric~\mbox{%DIFAUXCMD
\cite{virtanenSciPyFundamentalAlgorithms2020} }\hskip0pt%DIFAUXCMD
to calculate the distance between the predicted and expected taxonomic distribution }\DIFaddend (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure4}}\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure3}}\DIFaddend C).
\DIFdelbegin \DIFdel{Again, we observe }\DIFdelend \DIFaddbegin \DIFadd{We find }\DIFaddend that none of the databases perform better than the others in absolute terms \DIFaddbegin \DIFadd{and that the dissimilarity with the expected composition is high ($>0.5$ for Family and Genus and $>0.9$ for Species), indicating that all the databases have some limitations when trying to recapture the expected taxonomic composition}\DIFaddend .
\DIFdelbegin \DIFdel{Given that }\DIFdelend \DIFaddbegin \DIFadd{Since }\DIFaddend no database performs better than others against mock datasets, \DIFdelbegin \DIFdel{and that databases are almost equally distant from each other in terms of final output, }\DIFdelend the choice of which database to use \DIFdelbegin \DIFdel{should }\DIFdelend \DIFaddbegin \DIFadd{could }\DIFaddend be driven by other \DIFdelbegin \DIFdel{reason.
One user-specific way to choose, would be based on the known representation of taxa for the microbiome of interest (see also Discussion).
Another reason }\DIFdelend \DIFaddbegin \DIFadd{reasons (see Supplementary discussion).
One reason to choose a particular database }\DIFaddend could be the frequency of updates and the potential for future growth\DIFdelbegin \DIFdel{, which prompted us to set }\DIFdelend \DIFaddbegin \DIFadd{.
Both }\ac{gg}\DIFadd{, due to its frequent use in the literature~\mbox{%DIFAUXCMD
\cite{Balvociute2017}}\hskip0pt%DIFAUXCMD
, and }\DIFaddend \ac{ncbi}\DIFdelbegin \DIFdel{as the }%DIFDELCMD < \ac{micone} %%%
\DIFdel{standard }\DIFdelend \DIFaddbegin \DIFadd{, due to its regular revision and maintenance, could be good choices }\DIFaddend for taxonomy assignment.
In \DIFdelbegin \DIFdel{addition to being regularly maintained and updated the }%DIFDELCMD < \ac{ncbi} %%%
\DIFdel{database already has the advantage that its accuracy of assignments is still comparable to the SILVA and }\DIFdelend \DIFaddbegin \DIFadd{our default pipeline, we choose }\DIFaddend \ac{gg} \DIFdelbegin \DIFdel{reference databases that are routinely used as reference databases}\DIFdelend \DIFaddbegin \DIFadd{as the default method.
}
\DIFadd{The TA step results in a taxonomic counts table that is used as input to the subsequent steps of the pipeline.
Note that the count tables at different levels can be obtained through aggregation; for example, Genus count tables were obtained by summing up the counts of the lower taxonomy levels (Species and }\ac{otu}\DIFadd{) that map to the same higher taxonomy level entity}\DIFaddend .
\FloatBarrier
\subsection*{\DIFdelbegin \DIFdel{Networks generated using different }\DIFdelend \DIFaddbegin \DIFadd{NI: Different }\DIFaddend network inference methods \DIFdelbegin \DIFdel{show notable difference in }\DIFdelend \DIFaddbegin \DIFadd{drastically affect }\DIFaddend edge-density and connectivity}
%DIF < TODO: Talk about the difference between correlations and associations
\DIFdelbegin \DIFdel{The six different }\DIFdelend \DIFaddbegin \DIFadd{The ten }\DIFaddend network inference methods \DIFaddbegin \DIFadd{we }\DIFaddend used in this \DIFdelbegin \DIFdel{study are }%DIFDELCMD < \ac{magma}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Cougoul2019}}\hskip0pt%DIFAUXCMD
, }%DIFDELCMD < \ac{mldm}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Yang2017}}\hskip0pt%DIFAUXCMD
, }%DIFDELCMD < \ac{spieceasi}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Kurtz2015}}\hskip0pt%DIFAUXCMD
, }%DIFDELCMD < \ac{sparcc}%%%
\DIFdel{~\mbox{%DIFAUXCMD
\cite{Friedman2012}}\hskip0pt%DIFAUXCMD
, Spearman and Pearson.
These network inference methods }\DIFdelend \DIFaddbegin \DIFadd{step }\DIFaddend fall into two groups\DIFdelbegin \DIFdel{, }\DIFdelend \DIFaddbegin \DIFadd{: }\DIFaddend the first set of methods (Pearson, Spearman, \DIFdelbegin %DIFDELCMD < \ac{sparcc}%%%
\DIFdelend \DIFaddbegin \acs{sparcc}\DIFadd{~\mbox{%DIFAUXCMD
\cite{Friedman2012,Watts2018}}\hskip0pt%DIFAUXCMD
, and propr~\mbox{%DIFAUXCMD
\cite{quinnProprRpackageIdentifying2017}}\hskip0pt%DIFAUXCMD
}\DIFaddend ) infer pairwise correlations while the second set \DIFdelbegin \DIFdel{infer direct associations (}%DIFDELCMD < \ac{spieceasi}%%%
\DIFdel{, }%DIFDELCMD < \ac{mldm}%%%
\DIFdel{, }%DIFDELCMD < \ac{magma}%%%
\DIFdel{) }\DIFdelend \DIFaddbegin \DIFadd{(}\acs{spieceasi}\DIFadd{~\mbox{%DIFAUXCMD
\cite{Kurtz2015}}\hskip0pt%DIFAUXCMD
, FlashWeave~\mbox{%DIFAUXCMD
\cite{tackmannRapidInferenceDirect2019}}\hskip0pt%DIFAUXCMD
, }\acs{cozine}\DIFadd{~\mbox{%DIFAUXCMD
\cite{haCompositionalZeroinflatedNetwork2020a}}\hskip0pt%DIFAUXCMD
, }\acs{harmonies}\DIFadd{~\mbox{%DIFAUXCMD
\cite{jiangHARMONIESHybridApproach2020}}\hskip0pt%DIFAUXCMD
, }\acs{spring}\DIFadd{~\mbox{%DIFAUXCMD
\cite{yoonMicrobialNetworksSPRING2019}}\hskip0pt%DIFAUXCMD
, and }\acs{mldm}\DIFadd{~\mbox{%DIFAUXCMD
\cite{Yang2017}}\hskip0pt%DIFAUXCMD
) infer direct associations.
Note that while Pearson and Spearman methods are included in the pipeline for completeness, they tend to generate a large number of spurious edges as they are not intended for compositional datasets.
Thus, they are not included in subsequent quantitative analyses}\DIFaddend .
\DIFdelbegin \DIFdel{Pairwise correlation methods involve calculating the correlation coefficient between every pair of }%DIFDELCMD < \ac{otu}%%%
\DIFdel{/}%DIFDELCMD < \ac{esv}%%%
\DIFdel{s leading to the detection of spurious indirect connections.
On the other hand, direct association methods use conditional independence to avoid the detection of correlated but indirectly connected }%DIFDELCMD < \ac{otu}%%%
\DIFdel{s~\mbox{%DIFAUXCMD
\cite{Kurtz2015,Menon2018}}\hskip0pt%DIFAUXCMD
.
}\DIFdelend
\DIFdelbegin \DIFdel{For the analysis presented in this section, we used the taxonomy composition }\DIFdelend \DIFaddbegin \DIFadd{Filtered (see }\ac{op} \DIFadd{step in Methods) genus-level counts }\DIFaddend table obtained using the \DIFdelbegin %DIFDELCMD < \ac{ncbi} %%%
\DIFdel{reference database as the input for algorithms that infer co-occurrence associations between the microbes.
Figure~\ref{fig:figure5}Ashows the networks inferred from this dataset using the different inference algorithms.
The different }\DIFdelend \DIFaddbegin \DIFadd{default settings in the previous steps were used as input for the different network inference algorithms (Figure~\ref{fig:figure4}).
Even from a visual inspection (Figure~\ref{fig:figure4}A), one can see that the different }\DIFaddend networks differ vastly in their edge-density and connectivity\DIFdelbegin \DIFdel{; even some of the edges in common to these networks have their signs inverted . Note, however, that some of these comparisons depend on the threshold that has to be applied to the pairwise correlations methods (currently 0.3, based on~\mbox{%DIFAUXCMD
\cite{Friedman2012}}\hskip0pt%DIFAUXCMD
).
To get a more quantitative picture of }\DIFdelend \DIFaddbegin \DIFadd{, with common edges often displaying inverted signs.
}
\DIFadd{To quantify }\DIFaddend the differences between the \DIFdelbegin \DIFdel{inferred }\DIFdelend networks, we \DIFdelbegin \DIFdel{checked }\DIFdelend \DIFaddbegin \DIFadd{analyzed }\DIFaddend the distribution of common nodes and edges (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure5}B }\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure4} B and \ref{fig:figure4}C}\DIFaddend ) using UpSet plots~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
\cite{Lex} }\hskip0pt%DIFAUXCMD
(only }%DIFDELCMD < \ac{magma}%%%
\DIFdel{, }%DIFDELCMD < \ac{mldm}%%%
\DIFdel{, }%DIFDELCMD < \ac{spieceasi}%%%
\DIFdel{, }%DIFDELCMD < \ac{sparcc} %%%
\DIFdel{are used in the comparison since Pearson and Spearman add a large number of spurious edges since they are not intended for compositional datasets).
The results for the node intersections show }\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
\cite{lexUpSetVisualizationIntersecting2014}}\hskip0pt%DIFAUXCMD
.
The node intersection analysis shows }\DIFaddend that the networks have \DIFdelbegin \DIFdel{a large number of nodes in common ($63$ out of $67$ nodes in the smallest network - }%DIFDELCMD < \ac{magma}%%%
\DIFdel{) and }\DIFdelend \DIFaddbegin \DIFadd{$33$ out of $68$ total unique nodes in common and that }\DIFaddend no network possesses \DIFdelbegin \DIFdel{any }\DIFdelend \DIFaddbegin \DIFadd{a }\DIFaddend unique node.
\DIFdelbegin \DIFdel{The edge }\DIFdelend \DIFaddbegin \DIFadd{Edge }\DIFaddend intersections in contrast show that only \DIFdelbegin \DIFdel{$19$ }\DIFdelend \DIFaddbegin \DIFadd{$8$ }\DIFaddend edges (out of \DIFdelbegin \DIFdel{$98$ edgesin the smallest network - }%DIFDELCMD < \ac{magma}%%%
\DIFdelend \DIFaddbegin \DIFadd{$202$ total unique edges}\DIFaddend ) are in common between all the methods and each network has \DIFdelbegin \DIFdel{a large number of }\DIFdelend \DIFaddbegin \DIFadd{many }\DIFaddend unique edges.
These results \DIFdelbegin \DIFdel{indicate that there is }\DIFdelend \DIFaddbegin \DIFadd{showed }\DIFaddend a substantial rewiring of connections in \DIFdelbegin \DIFdel{the }\DIFdelend \DIFaddbegin \DIFadd{different inferred networks and prompted us to identify associations robust across methods, through consensus algorithms.
}
\FloatBarrier
\subsection*{\DIFadd{NI: The scaled-sum consensus method shows high precision on benchmark datasets}}
\DIFadd{Inspired by previous approaches~\mbox{%DIFAUXCMD
\cite{bustinceFuzzySetsTheir2008,tsarevApplicationMajorityVoting2018}}\hskip0pt%DIFAUXCMD
, we developed two methods that take into consideration the evidence offered by each network inference algorithm and generate a consensus network that contains the common edges among the }\DIFaddend inferred networks.
\DIFdelbegin \DIFdel{Unlike the }\DIFdelend \DIFaddbegin \DIFadd{Both of our approaches - simple voting (SV) and scaled-sum (SS) - combine appropriately filtered networks inferred from correlation-based and direct association methods (see Methods).
We chose the scaled-sum method as the pipeline default since this method takes into account the weights of the associations in the determination of the final consensus.
The pipeline enables the selection of any subset of methods for the consensus calculation. Currently, by default, all direct methods are used, together with }\acs{sparcc} \DIFadd{and propr.
}
\DIFadd{Similar to what was done for the }\DIFaddend previous steps of the pipeline, \DIFdelbegin \DIFdel{where were }\DIFdelend \DIFaddbegin \DIFadd{and in analogy with previous estimations of network inference accuracy~\mbox{%DIFAUXCMD
\cite{Kurtz2015,Weiss2016}}\hskip0pt%DIFAUXCMD
, }\DIFaddend we evaluated the \DIFdelbegin \DIFdel{performance of methods on mock datasets, there is no equivalent dataset that contain a set of known interactions for }\DIFdelend \DIFaddbegin \DIFadd{network inference algorithms and the final consensus network using synthetic interaction data.
For this purpose, we generated synthetic interaction data using the ``NorTA''~\mbox{%DIFAUXCMD
\cite{Kurtz2015} }\hskip0pt%DIFAUXCMD
and ``seqtime''~\mbox{%DIFAUXCMD
\cite{faustSignaturesEcologicalProcesses2018} }\hskip0pt%DIFAUXCMD
methods (see Methods).
For each method, an }\ac{otu} \DIFadd{counts table was generated based on the selected parameters and abundance distributions.
This counts table was used as the input to the }\ac{micone} \DIFadd{pipeline to generate predicted associations.
The interaction network used to generate the counts table was used as the source of true interactions to calculate the precision (Figure \ref{fig:figure5}) and sensitivity (Figure \ref{fig:figure_s5} and Figure \ref{fig:figure_s6}) for each network inference algorithm.
As shown in Figure \ref{fig:figure5} the consensus algorithm, especially the scaled-sum method, captures true associations with high precision (through the removal of edges that are either not present in most of the inference methods or whose association strength is low across methods).
Overall, }\DIFaddend the \DIFdelbegin \DIFdel{evaluation of the network inference algorithms.
Therefore, we propose the construction of a consensus network (Figure~\ref{fig:figure5}C)involving }%DIFDELCMD < \ac{magma}%%%
\DIFdel{, }%DIFDELCMD < \ac{mldm}%%%
\DIFdel{, }%DIFDELCMD < \ac{spieceasi} %%%
\DIFdel{and }%DIFDELCMD < \ac{sparcc}%%%
\DIFdel{.
This consensus network is built by merging the p-values generated from bootstraps of the original taxonomy composition table using the Browns p-value combining method~\mbox{%DIFAUXCMD
\cite{Poole} }\hskip0pt%DIFAUXCMD
}\DIFdelend \DIFaddbegin \DIFadd{scaled-sum method for $p=1.000$ performs the best (precision = $1.000$ for both NorTA and seqtime).
The scaled-sum method for $p=0.333$ (default option in the pipeline) shows a high precision ($0.956$ with NorTA; $0.688$ with seqtime), without displaying significant reduction in sensitivity (Figure~\ref{fig:figure_s5} and Figure~\ref{fig:figure_s6}).
However, if higher precision is required $p>0.5$ can be considered.
Therefore, the consensus networks provide the means to obtain a short list of associations that would have a high likelihood of being present in the real association network.
}
\FloatBarrier
\subsection*{\DIFadd{Impact of different pipeline steps on co-occurrence networks}}
\DIFadd{In order to analyze the effect of different processing methods on the inferred co-occurrence networks (before consensus estimation), we generated networks using all possible combinations of methods and quantified the variability due to each choice (Figure \ref{fig:figure6}A).
This was achieved by building a linear model of the edges of the network as a function of the various steps in the pipeline workflow }\DIFaddend (see Methods\DIFdelbegin \DIFdel{section).
Based on this approach, }%DIFDELCMD < \ac{micone} %%%
\DIFdel{reports as default output the consensus network , annotated with weights (correlations for }%DIFDELCMD < \ac{sparcc} %%%
\DIFdel{and direct associations for the other methods) for all four methods}\DIFdelend \DIFaddbegin \DIFadd{)}\DIFaddend .
\DIFaddbegin \DIFadd{Figure \ref{fig:figure6}A, shows the percentage of total variation among the co-occurrence networks due to the different steps of the pipeline.
The }\ac{ta} \DIFadd{step, or more specifically the choice of 16S reference database, contributes the most ($65.4\%$) to the variation in the networks, followed by the }\ac{op} \DIFadd{step ($26.8\%$).
This result highlights the importance of the taxonomy assignment step in the 16S data analysis workflow, implying that a change in the reference database will result in drastically different inferred networks.
This is likely due to the differential assignment of representative sequences to taxonomic entities (Figure~\ref{fig:figure3} and Figure~\ref{fig:figure_s4}), which drastically alter the nodes and hence the underlying network topology.
}\DIFaddend
\DIFaddbegin \DIFadd{The effects of the different steps of the pipeline on the inferred networks can be visualized through dimensionality reduction.
The PCA in Figure \ref{fig:figure6}B shows all the above networks, colored by the tools used in the DC, TA, OP, and NI steps in each subfigure.
The major effect of the TA step choice, shown before in Figure \ref{fig:figure6}A, is also reflected in the PCA plot, where networks segregate based on the database used (Figure~\ref{fig:figure6}B and Figure~\ref{fig:figure_s1}).
Additionally, the plot also shows that the variation between the networks decreases when the low abundance }\ac{otu}\DIFadd{s are removed from the network.
It is also evident that in the NI step, some networks, especially those inferred using the direct association network inference methods, are much closer in the PCA plot regardless of the reference database used.
These results suggest that the most important criterion for accurate comparative analysis of co-occurrence networks is the taxonomy reference database followed by the level of filtering of the taxonomy tables and the network inference algorithm used.
}
\DIFaddend \FloatBarrier
\DIFdelbegin \subsubsection*{\DIFdel{The default pipeline}}
%DIFAUXCMD
\DIFdelend \DIFaddbegin \subsection*{\DIFadd{The default pipeline}}
\DIFaddend
The systematic analyses \DIFdelbegin \DIFdel{performed }\DIFdelend in the previous sections \DIFdelbegin \DIFdel{clearly show }\DIFdelend \DIFaddbegin \DIFadd{illustrate }\DIFaddend that the choice of tools and parameters can have a big impact on the final \DIFaddbegin \DIFadd{consensus }\DIFaddend co-occurrence network.
\DIFdelbegin \DIFdel{For some of these choices (e.g. }%DIFDELCMD < \ac{dada2} %%%
\DIFdel{vs. deblur) there is no clear metric to establish a best protocol.
For other choices}\DIFdelend \DIFaddbegin \DIFadd{However}\DIFaddend , the mock communities \DIFaddbegin \DIFadd{and synthetic data }\DIFaddend provide an opportunity to select \DIFdelbegin \DIFdel{combination of parameters that yield more }\DIFdelend \DIFaddbegin \DIFadd{combinations of tools that yield the most }\DIFaddend accurate and robust results.
\DIFdelbegin \DIFdel{Despite this partial degree of assessment, we wish to suggest a combination }\DIFdelend \DIFaddbegin \DIFadd{As highlighted in the above sections for individual steps, we propose a set }\DIFaddend of tools and parameters \DIFdelbegin \DIFdel{that produce networks that are derived from the combination of tools which performed best on the mock communities, and displayed highest robustness to switching to alternative methods.
These tools and parameters are chosen }\DIFdelend as the defaults for the pipeline \DIFdelbegin \DIFdel{and are given in Table~\ref{tab:default_options}}\DIFdelend \DIFaddbegin \DIFadd{(Table~\ref{tab:micone_tools})}\DIFaddend .
\DIFdelbegin \DIFdel{The recommended tool for the }%DIFDELCMD < \ac{dc} %%%
\DIFdel{step (}%DIFDELCMD < \ac{dada2} %%%
\DIFdel{or Deblur) were chosen based on their accuracy in recapitulating the reference sequences in mock communities and synthetic data.
The choice of }\DIFdelend \DIFaddbegin \DIFadd{Figure~\ref{fig:figure7} shows the co-occurrence networks inferred for the healthy subjects (control) and subjects with autism specific disorder (ASD) in }\DIFaddend the \DIFdelbegin \DIFdel{taxonomy reference database in the }%DIFDELCMD < \ac{ta} %%%
\DIFdel{step is dictated largely by the species expected to be present in the sample as well the database used in similar studies if comparison is a goal.
Nevertheless, we suggest }%DIFDELCMD < \ac{ncbi} %%%
\DIFdel{RefSeq along with blast+ as the query tool since the database is updated regularly and has a broad collection of taxonomies.
The abundance threshold at the }%DIFDELCMD < \ac{op} %%%
\DIFdel{step is determined automatically based on the number of samplesand the required statistical power.
Finally, we use the Browns p-value combining method on the networksgenerated using }%DIFDELCMD < \ac{magma}%%%
\DIFdel{, }%DIFDELCMD < \ac{mldm}%%%
\DIFdel{, }%DIFDELCMD < \ac{spieceasi} %%%
\DIFdel{and }%DIFDELCMD < \ac{sparcc} %%%
\DIFdel{to obtain a final consensus networkin the }%DIFDELCMD < \ac{ni} %%%
\DIFdel{step}\DIFdelend \DIFaddbegin \DIFadd{fecal microbiome transplant study~\mbox{%DIFAUXCMD
\cite{Kang2017} }\hskip0pt%DIFAUXCMD
(constructed using the default tools and parameters from Table~\ref{tab:micone_tools}).
This figure demonstrates a typical use case of comparative analysis of networks using the }\ac{micone} \DIFadd{pipeline.
As a consequence of using the consensus network algorithm, the final co-occurrence networks are sparse and can be visually compared and examined.
}
\DIFadd{The analysis of the rewiring of associations in the ASD samples with respect to the control provides a guide for the identification of key genera that could be linked to dysbiosis.
We observed 22 unique links in the network for control samples, 12 unique links in the network for ASD subjects, and 7 edges in common between the two networks.
Although these unique associations do not imply actual interactions, they can still serve as potential starting points for literature surveys and further experimental exploration of mechanistic processes underlying dysbiosis.
For example, }\textit{\DIFadd{Prevotella}} \DIFadd{and }\textit{\DIFadd{Porphyromonas}}\DIFadd{, genera previously implicated in ASD~\mbox{%DIFAUXCMD
\cite{Kang2017,hoGutMicrobiotaChanges2020} }\hskip0pt%DIFAUXCMD
and cognitive impairment~\mbox{%DIFAUXCMD
\cite{chiPorphyromonasGingivalisInducedCognitive2021} }\hskip0pt%DIFAUXCMD
display modified connectivity in our network, suggesting that the observed associations may be relevant for understanding the role of these bacteria in disease.
Additional visualization and comparison of networks can be performed using the }\acf{mind}\DIFadd{~\mbox{%DIFAUXCMD
\cite{huResourceComparisonIntegration2022}}\hskip0pt%DIFAUXCMD
}\DIFaddend .
Figure~\DIFdelbegin \DIFdel{\ref{fig:figure6}A shows }\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure_s7} shows a sensitivity analysis in which we compared }\DIFaddend the default network \DIFdelbegin \DIFdel{compared }\DIFdelend against networks generated by altering one of the steps of the pipeline \DIFdelbegin \DIFdel{from }\DIFdelend \DIFaddbegin \DIFadd{relative to }\DIFaddend the default.
\DIFdelbegin \DIFdel{These results indicate that the biggest differences in networks occur when the reference database or the network inference algorithm are changed.
Furthermore, the L1 distance of networks generated by altering one of the steps of the pipeline from the default against the default network }\DIFdelend \DIFaddbegin \DIFadd{This result, both visually }\DIFaddend (Figure~\DIFdelbegin \DIFdel{\ref{fig:figure6}B) shows that the biggest deviations from the default network }\DIFdelend \DIFaddbegin \DIFadd{\ref{fig:figure_s7} A), and quantitatively (Figure~\ref{fig:figure_s7} B) suggests that the most significant changes }\DIFaddend occur when the \DIFaddbegin \ac{op} \DIFadd{or }\DIFaddend \ac{ta} \DIFdelbegin \DIFdel{and }%DIFDELCMD < \ac{ni} %%%
\DIFdelend steps are changed \DIFdelbegin \DIFdel{, reinforcing the same results observed in Figure~\ref{fig:figure2}. Figure~\ref{fig:figure7} shows the co-occurrence networks inferred for the hard palate for healthy subjects in a periodontal disease study~\mbox{%DIFAUXCMD
\cite{Chen2018} }\hskip0pt%DIFAUXCMD
and the healthy stool microbiome in fecal microbial transplant study~\mbox{%DIFAUXCMD
\cite{Kang2017}}\hskip0pt%DIFAUXCMD
. These consensus networks were generated using the default tools and parameters from Table~\ref{tab:default_options}}\DIFdelend \DIFaddbegin \DIFadd{from the default value}\DIFaddend .
% DISCUSSION
%!TEX root = ../main.tex
\section*{Discussion}
%DIF < General statements
\DIFdelbegin \DIFdel{Co-occurrence associations in microbial communities help identify important interactions that drive microbial community structure and organization.
Our analysis shows }\DIFdelend \DIFaddbegin \subsection*{\DIFadd{Why }\ac{micone}\DIFadd{?}}
\DIFadd{A myriad of tools and methods have been developed for different parts of the workflow for inference of co-occurrence networks from 16S rRNA data.
Our analyses have shown }\DIFaddend that networks generated using different combinations of tools and approaches can \DIFdelbegin \DIFdel{look significantly }\DIFdelend \DIFaddbegin \DIFadd{be substantially }\DIFaddend different from each other, highlighting the \DIFdelbegin \DIFdel{importance of a clear assessment }\DIFdelend \DIFaddbegin \DIFadd{need for a clear evaluation }\DIFaddend of the source of variability and \DIFdelbegin \DIFdel{of }\DIFdelend \DIFaddbegin \DIFadd{for }\DIFaddend tools that provide the most robust and accurate results.
Our newly developed \DIFdelbegin \DIFdel{integrated software}\DIFdelend \DIFaddbegin \DIFadd{software, }\ac{micone}\DIFadd{, is a customizable pipeline }\DIFaddend for the inference of co-occurrence networks from 16S rRNA data \DIFdelbegin \DIFdel{, }%DIFDELCMD < \ac{micone}%%%
\DIFdel{, constitutes a freely customizable and user friendly pipeline that allows users to easily test combinations of tools and to }\DIFdelend \DIFaddbegin \DIFadd{that enables users to }\DIFaddend compare networks generated by multiple possible \DIFdelbegin \DIFdel{choices (see Methods)}\DIFdelend \DIFaddbegin \DIFadd{combinations of tools and parameters}\DIFaddend .
Importantly, in addition to revisiting the test cases presented in this work, users will be able to explore the effect of various tool combinations on their own datasets of interest.
The \ac{micone} pipeline \DIFdelbegin \DIFdel{is }\DIFdelend \DIFaddbegin \DIFadd{has been }\DIFaddend built in a modular fashion\DIFdelbegin \DIFdel{.
Its }\DIFdelend \DIFaddbegin \DIFadd{; its }\DIFaddend plug-and-play architecture \DIFdelbegin \DIFdel{will make it possible for }\DIFdelend \DIFaddbegin \DIFadd{enables }\DIFaddend users to add new tools and steps, either \DIFdelbegin \DIFdel{from existing packages , or from packages that were not }\DIFdelend \DIFaddbegin \DIFadd{using existing packages that have not been }\DIFaddend examined in the present work \DIFdelbegin \DIFdel{, as well as futureones.
}%DIFDELCMD <
%DIFDELCMD < %%%
\DIFdel{The main outcome of this work is thus two-fold: on one hand we transparently reveal }\DIFdelend \DIFaddbegin \DIFadd{or those developed in the future.
The }\ac{micone} \DIFadd{Python package provides functions and methods to perform a detailed analysis of the count matrices and the co-occurrence networks.
The inferred networks are exported to a custom JSON format (see Supplementary) by default, but can also be exported to Cytoscape~\mbox{%DIFAUXCMD
\cite{shannonCytoscapeSoftwareEnvironment2003}}\hskip0pt%DIFAUXCMD
, GML~\mbox{%DIFAUXCMD
\cite{himsoltGMLPortableGraph2010}}\hskip0pt%DIFAUXCMD
, and many other popular formats via the Python package.
}
\DIFadd{While several tools/workflows such as }\ac{qiime2}\DIFadd{~\mbox{%DIFAUXCMD
\cite{bolyenReproducibleInteractiveScalable2019} }\hskip0pt%DIFAUXCMD
and NetCoMi~\mbox{%DIFAUXCMD
\cite{peschelNetCoMiNetworkConstruction2020} }\hskip0pt%DIFAUXCMD
can be used to generate co-occurrence networks from 16S sequencing data, no single tool exist that integrates the complete process of inferring microbial interaction networks from 16S sequencing reads.
}\ac{micone} \DIFadd{is unique as it offers this functionality packaged in a workflow that can be run locally, on the compute cluster, or in the cloud.
}
\subsection*{\DIFadd{The default pipeline and recommended tools}}
\DIFadd{Through }\ac{micone}\DIFadd{, in addition to transparently revealing }\DIFaddend the dependence of co-occurrence networks on tool and parameter choices \DIFdelbegin \DIFdel{, making it possible to more rigorously assess and compare existing networks.
On the other hand, we take }\DIFdelend \DIFaddbegin \DIFadd{(see Discussion in Supplementary Text for details on the DC, TA and OP steps), we have taken }\DIFaddend advantage of our spectrum of computational options and the availability of mock and synthetic datasets, to suggest a default standard setting\DIFdelbegin \DIFdel{, and }\DIFdelend \DIFaddbegin \DIFadd{.
Additionally, we have developed }\DIFaddend a consensus approach, \DIFdelbegin \DIFdel{likely to yield }\DIFdelend \DIFaddbegin \DIFadd{that can reliably generate }\DIFaddend networks that are \DIFaddbegin \DIFadd{fairly }\DIFaddend robust across multiple tool \DIFdelbegin \DIFdel{/parameter choices.
}%DIFDELCMD <
%DIFDELCMD < %%%
\DIFdelend \DIFaddbegin \DIFadd{choices.
}\DIFaddend An important caveat related to \DIFdelbegin \DIFdel{this last point is the fact that }\DIFdelend \DIFaddbegin \DIFadd{these results is that due to the lack of a universal standard for microbial interaction data, }\DIFaddend our conclusions are based on the specific datasets used in our analysis.
While our \DIFdelbegin \DIFdel{datasets cover a relatively broad spectrum of biomes and sequencing pipelines}\DIFdelend \DIFaddbegin \DIFadd{analysis is based on several mock and synthetic datasets that cover a diverse range of abundance distributions and network topologies}\DIFaddend , datasets that have drastically different distributions may require a re-assessment of the best settings\DIFdelbegin \DIFdel{through our pipeline}\DIFdelend .
\DIFdelbegin \DIFdel{It is worth pointing out some additional more specific conclusions stemming from the individual steps of our analysis.