False negatives - some sort of interaction with gzip / bgzip #26

rmcolq · 2024-08-06T15:40:08Z

Strange bug, and might be more of a user error

I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.

If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.

All works as expected with the second file and it finds all the kmers for each of the reference sequences.

broken.fa.gz
works.fa.gz

minimal script:

#!/usr/bin/env python

import cobs_index as cobs
from Bio import SeqIO
import gzip

p = cobs.CompactIndexParameters()
p.term_size = 31               # k-mer size
p.clobber = True               # overwrite output and temporary files
p.false_positive_rate = 0.4    # higher false positive rate -> smaller index

broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")

with gzip.open(broken, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")

with gzip.open(works, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

The text was updated successfully, but these errors were encountered:

shenwei356 · 2024-08-06T15:59:39Z

I tried to index with cobs 0.3.0 with these two files. It funny to see that the numbers of sequences are different.

Works (10 sequences)

> cobs compact-construct works.fa.gz  works.cobs_compact

--- document list (1 entries) ---
FastaFile: loading index works.fa.gz.cobs_cache [10 subsequences]
document[0] size 25796 31-mers 92788 : works.fa.gz : works
--- end of document list (1 entries) ---
documents: 1
minimum 31-mers: 92788
maximum 31-mers: 92788
average 31-mers: 92788
total 31-mers: 92788

Broken (7 sequences).

$ cobs compact-construct broken.fa.gz  broken.cobs_compact

--- document list (1 entries) ---
FastaFile: loading index broken.fa.gz.cobs_cache [7 subsequences]
document[0] size 26790 31-mers 63522 : broken.fa.gz : broken
--- end of document list (1 entries) ---
documents: 1
minimum 31-mers: 63522
maximum 31-mers: 63522
average 31-mers: 63522
total 31-mers: 63522

But the contents of the two files are identical

zcat works.fa.gz | md5sum 
ef23c6df0195e5b4610c22aa30e5b70d  -
$ zcat broken.fa.gz | md5sum 
ef23c6df0195e5b4610c22aa30e5b70d  -

I guess there's something wrong with the zlib or the sequence parser of COBS (kseq from Heng).

rmcolq · 2024-08-06T16:15:30Z

Maybe bgzip defines the first block part way through the 7th sequence. But seems like the sort of error which shouldn't be silent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False negatives - some sort of interaction with gzip / bgzip #26

False negatives - some sort of interaction with gzip / bgzip #26

rmcolq commented Aug 6, 2024

shenwei356 commented Aug 6, 2024 •

edited

Loading

rmcolq commented Aug 6, 2024

False negatives - some sort of interaction with gzip / bgzip #26

False negatives - some sort of interaction with gzip / bgzip #26

Comments

rmcolq commented Aug 6, 2024

shenwei356 commented Aug 6, 2024 • edited Loading

rmcolq commented Aug 6, 2024

shenwei356 commented Aug 6, 2024 •

edited

Loading