You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.
If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.
All works as expected with the second file and it finds all the kmers for each of the reference sequences.
#!/usr/bin/env python
import cobs_index as cobs
from Bio import SeqIO
import gzip
p = cobs.CompactIndexParameters()
p.term_size = 31 # k-mer size
p.clobber = True # overwrite output and temporary files
p.false_positive_rate = 0.4 # higher false positive rate -> smaller index
broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")
with gzip.open(broken, "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
l = len(record) - 30
print(record.id, l)
r = s.search(str(record.seq), num_results=2)
for result in r:
print(result.score, result.doc_name, float(result.score)/l)
works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")
with gzip.open(works, "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
l = len(record) - 30
print(record.id, l)
r = s.search(str(record.seq), num_results=2)
for result in r:
print(result.score, result.doc_name, float(result.score)/l)
The text was updated successfully, but these errors were encountered:
Strange bug, and might be more of a user error
I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.
If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.
All works as expected with the second file and it finds all the kmers for each of the reference sequences.
broken.fa.gz
works.fa.gz
minimal script:
The text was updated successfully, but these errors were encountered: