DNA-xLSTM

This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:

DNA-xLSTM (current repository)
Prot-xLSTM
Chem-xLSTM

DNA-xLSTM

Installation

To get started, create a conda environment containing the required dependencies.

conda env create -f xlstm_dna_env.yml

Activate the environment.

conda activate dna_xlstm

Create the following directory to store saved models:

mkdir outputs

Data Preparation

(Data downloading instructions are adapted from HyenaDNA repo)

First, download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).

The file structure should look like

data
|-- hg38/
    |-- hg38.ml.fa
    |-- human-sequences.bed

Download fasta (.fa format) file (of the entire human genome) into ./data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically. Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file).

mkdir -p data/hg38/
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/human-sequences.bed > data/hg38/human-sequences.bed
gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file

Pre-Training on the Human Genome

To pre-train a model on the human reference genome, move to the scripts_pretrain directory, then adapt and run the provided shell scripts. We provide shell scripts that detail model and training arguments for all supported models. Supported models include xLSTM, Mamba, Caduceus, Transformer++ (Llama), and Hyena.

scripts_pretrain
|-- run_pretrain_xlstm.sh
|-- run_pretrain_mamba.sh
|-- run_pretrain_caduceus.sh
|-- run_pretrain_hyena.sh
|-- run_pretrain_llama.sh

cd scripts_pretrain
sh run_pretrain_xlstm.sh

Alternatively, you can launch pre-training from the command line. The following command will train a small bidirectional mLSTM with reverse complement augmentation. For more details on xLSTM arguments see scripts_pretrain/run_pretrain_xlstm.sh.

python train.py \
  experiment=hg38/hg38 \
  callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
  dataset.max_length=1024 \
  dataset.batch_size=1024 \
  dataset.mlm=true \
  dataset.mlm_probability=0.15 \
  dataset.rc_aug=true \
  model=xlstm \
  model.config.d_model=128 \
  model.config.n_layer=4 \
  model.config.max_length=1024 \
  model.config.s_lstm_at=[] \
  model.config.m_qkv_proj_blocksize=4 \
  model.config.m_num_heads=4 \
  model.config.m_proj_factor=2.0 \
  model.config.m_backend="chunkwise" \
  model.config.m_chunk_size=1024 \
  model.config.m_backend_bidirectional=false \
  model.config.m_position_embeddings=false \
  model.config.bidirectional=true \
  model.config.bidirectional_alternating=false \
  model.config.rcps=false \
  optimizer.lr="8e-3" \
  train.global_batch_size=8 \
  trainer.max_steps=10000 \
  trainer.precision=bf16 \
  +trainer.val_check_interval=10000 \
  wandb=null

xLSTM Model Weights

Pre-trained xLSTM model weights can be downloaded from here. Create a directory checkpoints in the root directory and store the downloaded weights. For downstream fine-tuning the following directory structure is expected:

checkpoints
|-- context_1k
|-- context_32k

Downstream Fine-Tuning

We support two downstream task collections for fine-tuning pre-trained models: Genomic Benchmarks and Nucleotide Transformer datasets.

Genomic Benchmarks introduced in Grešová et al. (2023) is a set of 8 classification tasks. The Nucleotide Transformer tasks are comprised of 18 classification datasets that were originally used in Dalla-Torre et al. (2023). The full task collection is hosted on Huggingface: InstaDeepAI/nucleotide_transformer_downstream_tasks.

Scripts to fine-tune pre-trained models are provided in scripts_downstream and can be adapted to perform hyperparameter sweeps.

scripts_downstream
|-- run_genomics.sh
|-- run_nucleotide.sh

cd scripts_downstream

For Genomics Benchmarks:

sh run_genomics.sh

For Nucleotide Transformer tasks:

sh run_nucleotide.sh

Acknowledgements

This repository is adapted from the Caduceus repository and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus itself is derived from the HyenaDNA codebase, which was originally built from the S4 and Safari repositories.

Citation

@article{schmidinger2024bio-xlstm,
  title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical  sequences},
  author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
  journal={arXiv},
  doi = {},
  year={2024},
  url={}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
configs		configs
models		models
scripts_downstream		scripts_downstream
scripts_pretrain		scripts_pretrain
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
train.py		train.py
xlstm_dna_env.yml		xlstm_dna_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA-xLSTM

DNA-xLSTM

Installation

Data Preparation

Pre-Training on the Human Genome

xLSTM Model Weights

Downstream Fine-Tuning

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

ml-jku/DNA-xLSTM

Folders and files

Latest commit

History

Repository files navigation

DNA-xLSTM

DNA-xLSTM

Installation

Data Preparation

Pre-Training on the Human Genome

xLSTM Model Weights

Downstream Fine-Tuning

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages