Merging two V4 runs together #1968

MadsBjornsen · 2024-06-04T12:38:11Z

Hi Ben and team,

I am currently working on two sets of data that I have received elsewhere, both from the V4 region. For the first dataset, I have both forward and reverse reads prior merging (2x150 bp), however the second one, they are already merged (2x150 bp. I have some questions for how to go forward with combining the two datasets.

What I have seen, I can handle the already merged reads as single end reads is that correct?
Can I combine the two datasets without problems? or could I run into trouble regarding bacterias being the same but due to different sequence length they are not combined in the end?
For this I am thinking of using the collapseNoMismatch function but are not sure as the data for the first dataset is not the best, and we might run into trouble there.

When I look at the sequence length of the datasets prior combining them, they look as follows
First dataset:
table(nchar(getSequences(seqtab_UMAMI)))
150 151 152 153 154 155 158 159 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
2 2 2 5 4 2 4 3 3 2 6 4 8 2 5 6 5 2 4 7 2 1 4 1 3 3 3
180 181 183 184 185 186 188 189 190 191 192 193 194 195 196 197 199 200 201 202 203 204 205 206 207 208 209
3 3 4 5 10 3 5 2 2 5 2 4 1 2 5 4 3 7 3 1 5 49 353 309 494 13 1
210 211 212 213 215 216 217 218 219 220 221 222 224 225 228 231 234 240 242 243 246 249 252 253 258 259 260
34 1 42 4 5 3 38 75 3 1 2 10 20 1 2 1 2 3 72 1 3 2 1 2 1 3 1
261 262 265 266 275 276 280 281 287 289 290 291 292 293
1 7 6 1 1 13 1 1 16 4 20 1134 15085 477

Second dataset:
table(nchar(getSequences(seqtab)))
290
12259

As far as I know, the biological lenght of V4 is around 254 right, so could the reason that the majority of the be at 290 and 292 bp be due to primers not being removed and be fixed by trim left?

Thank you.

hjarnek · 2024-07-22T19:22:41Z

Not part of the development team, but here are my two cents.

What I have seen, I can handle the already merged reads as single end reads is that correct?

You will mess up the denoising process if you merge first, because DADA2 uses the original quality scores to model the sequencing run error profile. But if the merged reads are all you've got, I guess you've gotta do the best you can with them as single-end data.

Can I combine the two datasets without problems? or could I run into trouble regarding bacterias being the same but due to different sequence length they are not combined in the end?

DADA2 denoising is based on modelling the sequencing run error profile, so you should not pool datasets that are from different sequencing runs prior to running learnErrors and dada. Now that half your data is already merged, it should be treated as a separate sequencing run regardless. After dada has done its job and both datasets are merged, you can mergeSequenceTable.

As far as I know, the biological lenght of V4 is around 254 right, so could the reason that the majority of the be at 290 and 292 bp be due to primers not being removed and be fixed by trim left?

Could be. You should use a dedicated adapter removal program to remove primers though, such as cutadapt, and not trimLeft. Cutadapt will output statistics on how many reads were adapter-trimmed, giving you an indication whether it had been done before. If your data has not been adapter-trimmed yet, you should use the --discard-untrimmed flag in cutadapt to remove reads that don't contain the primer sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging two V4 runs together #1968

Merging two V4 runs together #1968

MadsBjornsen commented Jun 4, 2024

hjarnek commented Jul 22, 2024

Merging two V4 runs together #1968

Merging two V4 runs together #1968

Comments

MadsBjornsen commented Jun 4, 2024

hjarnek commented Jul 22, 2024