Skip to content

Latest commit

 

History

History
executable file
·
167 lines (128 loc) · 6.09 KB

README.md

File metadata and controls

executable file
·
167 lines (128 loc) · 6.09 KB

xlstm

DNA-xLSTM

This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:

DNA-xLSTM

Installation

To get started, create a conda environment containing the required dependencies.

conda env create -f xlstm_dna_env.yml

Activate the environment.

conda activate dna_xlstm

Create the following directory to store saved models:

mkdir outputs

Data Preparation

(Data downloading instructions are adapted from HyenaDNA repo)

First, download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).

The file structure should look like

data
|-- hg38/
    |-- hg38.ml.fa
    |-- human-sequences.bed

Download fasta (.fa format) file (of the entire human genome) into ./data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically. Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file).

mkdir -p data/hg38/
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/human-sequences.bed > data/hg38/human-sequences.bed
gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file

Pre-Training on the Human Genome

To pre-train a model on the human reference genome, move to the scripts_pretrain directory, then adapt and run the provided shell scripts. We provide shell scripts that detail model and training arguments for all supported models. Supported models include xLSTM, Mamba, Caduceus, Transformer++ (Llama), and Hyena.

scripts_pretrain
|-- run_pretrain_xlstm.sh
|-- run_pretrain_mamba.sh
|-- run_pretrain_caduceus.sh
|-- run_pretrain_hyena.sh
|-- run_pretrain_llama.sh
cd scripts_pretrain
sh run_pretrain_xlstm.sh

Alternatively, you can launch pre-training from the command line. The following command will train a small bidirectional mLSTM with reverse complement augmentation. For more details on xLSTM arguments see scripts_pretrain/run_pretrain_xlstm.sh.

python train.py \
  experiment=hg38/hg38 \
  callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
  dataset.max_length=1024 \
  dataset.batch_size=1024 \
  dataset.mlm=true \
  dataset.mlm_probability=0.15 \
  dataset.rc_aug=true \
  model=xlstm \
  model.config.d_model=128 \
  model.config.n_layer=4 \
  model.config.max_length=1024 \
  model.config.s_lstm_at=[] \
  model.config.m_qkv_proj_blocksize=4 \
  model.config.m_num_heads=4 \
  model.config.m_proj_factor=2.0 \
  model.config.m_backend="chunkwise" \
  model.config.m_chunk_size=1024 \
  model.config.m_backend_bidirectional=false \
  model.config.m_position_embeddings=false \
  model.config.bidirectional=true \
  model.config.bidirectional_alternating=false \
  model.config.rcps=false \
  optimizer.lr="8e-3" \
  train.global_batch_size=8 \
  trainer.max_steps=10000 \
  trainer.precision=bf16 \
  +trainer.val_check_interval=10000 \
  wandb=null

xLSTM Model Weights

Pre-trained xLSTM model weights can be downloaded from here. Create a directory checkpoints in the root directory and store the downloaded weights. For downstream fine-tuning the following directory structure is expected:

checkpoints
|-- context_1k
|-- context_32k

Downstream Fine-Tuning

We support two downstream task collections for fine-tuning pre-trained models: Genomic Benchmarks and Nucleotide Transformer datasets.

Genomic Benchmarks introduced in Grešová et al. (2023) is a set of 8 classification tasks. The Nucleotide Transformer tasks are comprised of 18 classification datasets that were originally used in Dalla-Torre et al. (2023). The full task collection is hosted on Huggingface: InstaDeepAI/nucleotide_transformer_downstream_tasks.

Scripts to fine-tune pre-trained models are provided in scripts_downstream and can be adapted to perform hyperparameter sweeps.

scripts_downstream
|-- run_genomics.sh
|-- run_nucleotide.sh
cd scripts_downstream

For Genomics Benchmarks:

sh run_genomics.sh

For Nucleotide Transformer tasks:

sh run_nucleotide.sh

Acknowledgements

This repository is adapted from the Caduceus repository and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus itself is derived from the HyenaDNA codebase, which was originally built from the S4 and Safari repositories.

Citation

@article{schmidinger2024bio-xlstm,
  title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical  sequences},
  author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
  journal={arXiv},
  doi = {},
  year={2024},
  url={}
}