Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected output with BytesField, question for genomics #370

Open
d-laub opened this issue Mar 15, 2024 · 0 comments
Open

Unexpected output with BytesField, question for genomics #370

d-laub opened this issue Mar 15, 2024 · 0 comments

Comments

@d-laub
Copy link

d-laub commented Mar 15, 2024

Hi there! I'm looking into utilizing FFCV for genomics applications. In the process, I tried using the BytesField with a simple dataset to familiarize myself with its behavior. Am I using the API incorrectly?

pip list | grep ffcv = ffcv 1.0.2

>>> ffcv.__version__
'0.0.3rc1'

MRE

import torch as ch
from torch.utils.data import Dataset
import numpy as np
import ffcv
from ffcv.fields import BytesField

class FooDS(Dataset):
    def __init__(self):
        self.data = np.arange(5, dtype=np.uint8)

    def __len__(self):
        return 2
        
    def __getitem__(self, idx: int):
        if idx == 0:
            return self.data[:3]
        else:
            return self.data[3:]

ds = FooDS()
writer = ffcv.DatasetWriter('foo.beton', {'bytes': BytesField()})
writer.from_indexed_dataset(ds)

loader = ffcv.Loader(
    'foo.beton',
    batch_size=1,
    num_workers=1,
    order=ffcv.loader.OrderOption.SEQUENTIAL,
    pipelines={'bytes': [BytesField().get_decoder_class()()]}
)

for batch in loader:
    print(batch)

Expected

(array([[0, 1, 2]], dtype=uint8),)
(array([[3, 4]], dtype=uint8),) # or maybe (array([[3, 4, 0]], dtype=uint8),) if the data is automatically padded

Actual

(array([[0]], dtype=uint8),)
(array([[3]], dtype=uint8),)

For more context, I'm hoping to rapidly process DNA sequences with FFCV. To dramatically reduce on-disk footprint, I want to store variable length genotypes with FFCV, these are sufficient to reconstruct the much larger DNA sequences on-the-fly. In this setting, each instance from the dataset passed to FFCV would have two fields with a final length dimension that varies across instances.

  • "genotypes": shape = (2, length) dtype = int8
  • "positions": shape = (length) dtype = uintp

I'm hoping I can do this by implementing a dataset that views the data as uint8 and ravels it, and then add a transform to decode the data back to the intended shape and dtype. This could also reconstruct the DNA sequences which have uniform length across instances. Is this possible with FFCV? I would appreciate any recommendations, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant