This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:
- DNA-xLSTM (current repository)
- Prot-xLSTM
- Chem-xLSTM
To get started, create a conda environment containing the required dependencies.
conda env create -f xlstm_dna_env.yml
Activate the environment.
conda activate dna_xlstm
Create the following directory to store saved models:
mkdir outputs
(Data downloading instructions are adapted from HyenaDNA repo)
First, download the Human Reference Genome data.
It's comprised of 2 files, 1 with all the sequences (the .fasta
file), and with the intervals we use (.bed
file).
The file structure should look like
data
|-- hg38/
|-- hg38.ml.fa
|-- human-sequences.bed
Download fasta (.fa format) file (of the entire human genome) into ./data/hg38
.
~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically.
Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file).
mkdir -p data/hg38/
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/human-sequences.bed > data/hg38/human-sequences.bed
gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file
To pre-train a model on the human reference genome, move to the scripts_pretrain
directory, then adapt and run the provided shell scripts. We provide shell scripts that detail model and training arguments for all supported models. Supported models include xLSTM, Mamba, Caduceus, Transformer++ (Llama), and Hyena.
scripts_pretrain
|-- run_pretrain_xlstm.sh
|-- run_pretrain_mamba.sh
|-- run_pretrain_caduceus.sh
|-- run_pretrain_hyena.sh
|-- run_pretrain_llama.sh
cd scripts_pretrain
sh run_pretrain_xlstm.sh
Alternatively, you can launch pre-training from the command line. The following command will train a small bidirectional mLSTM with reverse complement augmentation. For more details on xLSTM arguments see scripts_pretrain/run_pretrain_xlstm.sh
.
python train.py \
experiment=hg38/hg38 \
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
dataset.max_length=1024 \
dataset.batch_size=1024 \
dataset.mlm=true \
dataset.mlm_probability=0.15 \
dataset.rc_aug=true \
model=xlstm \
model.config.d_model=128 \
model.config.n_layer=4 \
model.config.max_length=1024 \
model.config.s_lstm_at=[] \
model.config.m_qkv_proj_blocksize=4 \
model.config.m_num_heads=4 \
model.config.m_proj_factor=2.0 \
model.config.m_backend="chunkwise" \
model.config.m_chunk_size=1024 \
model.config.m_backend_bidirectional=false \
model.config.m_position_embeddings=false \
model.config.bidirectional=true \
model.config.bidirectional_alternating=false \
model.config.rcps=false \
optimizer.lr="8e-3" \
train.global_batch_size=8 \
trainer.max_steps=10000 \
trainer.precision=bf16 \
+trainer.val_check_interval=10000 \
wandb=null
Pre-trained xLSTM model weights can be downloaded from here. Create a directory checkpoints
in the root directory and store the downloaded weights. For downstream fine-tuning the following directory structure is expected:
checkpoints
|-- context_1k
|-- context_32k
We support two downstream task collections for fine-tuning pre-trained models: Genomic Benchmarks and Nucleotide Transformer datasets.
Genomic Benchmarks introduced in Grešová et al. (2023) is a set of 8 classification tasks. The Nucleotide Transformer tasks are comprised of 18 classification datasets that were originally used in Dalla-Torre et al. (2023). The full task collection is hosted on Huggingface: InstaDeepAI/nucleotide_transformer_downstream_tasks.
Scripts to fine-tune pre-trained models are provided in scripts_downstream
and can be adapted to perform hyperparameter sweeps.
scripts_downstream
|-- run_genomics.sh
|-- run_nucleotide.sh
cd scripts_downstream
For Genomics Benchmarks:
sh run_genomics.sh
For Nucleotide Transformer tasks:
sh run_nucleotide.sh
This repository is adapted from the Caduceus repository and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus itself is derived from the HyenaDNA codebase, which was originally built from the S4 and Safari repositories.
@article{schmidinger2024bio-xlstm,
title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical sequences},
author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
journal={arXiv},
doi = {},
year={2024},
url={}
}