🧬 Polyadenylation Site Prediction with DNA LMs

Species-aware DNA language models (DNABERT/SpeciesLM) fine-tuned with LoRA to localize 3β€²-UTR poly(A) sites near base-pair resolution. Benchmarked against BPNet and interpreted with TF-MoDISco
genomics
llms
fine-tuning
TUM
Authors

Hanad Abdullahi

Peter Nutter

MΓ₯ns Rosenbaum

Shirley Zhang

Published

Monday, July 1, 2024

Model Overview
TL;DR
  • We build an end-to-end ML pipeline for regulatory genomics: data prep, tokenizer/species tags, training (BPNet and SpeciesLM), hyperparameter sweeps, evaluation, and motif interpretation with DeepLIFT β†’ TF-MoDISco.
  • LoRA-tuned SpeciesLM/DNABERT outperforms BPNet and a frozen-LM on loss, Pearson correlation, and AUROC for poly(A) site prediction.
  • The pipeline yields interpretable biology: recovery of AAUAAA/variants, UAUAUA, and CA at the cut site, with expected variability.
  • Efficient and practical: adapter-based fine-tuning keeps compute and memory low while improving accuracy.

πŸ”— Shortcuts

πŸ“š Abstract

The genome encodes regulatory signals beyond protein-coding regions. In particular, 3β€²-UTR polyadenylation relies on upstream/downstream sequence elements whose diversity complicates classical alignment-based analyses. We frame poly(A) site prediction as a sequence modeling task and evaluate:

  • BPNet as a strong, interpretable CNN baseline, and
  • a Species-aware DNA LM (DNABERT/SpeciesLM) with (i) a shallow 1D-conv prediction head or (ii) a BPNet-style head.

We fine-tune the LM using Low-Rank Adaptation (LoRA) to efficiently adapt encoder blocks. Our best model (SpeciesLM + LoRA) achieves lower validation/test loss, higher Pearson correlation, and better AUROC than both the BPNet baseline and a frozen-LM alternative. With TF-MoDISco, we validate that learned features align with literature-reported polyadenylation elements, while noting limitations in motif spacing/order recoverability from local attributions.

πŸš€ Highlights

  • Models: BPNet baseline vs. SpeciesLM (DNABERT-style encoder, 12 layers, ~90M params).
  • Fine-tuning: LoRA on encoder blocks β†’ ~2.4M trainable params instead of full 90M.
  • Heads: (a) shallow 1D-conv; (b) BPNet-like dilated conv stack.
  • Interpretability: DeepLIFT β†’ TF-MoDISco recovers AAUAAA/UAUAAA/AAAAAA, UAUAUA, and CA motifs.
  • Takeaway: LoRA outperforms parameter-heavy alternatives while remaining compute-friendly.

πŸ§ͺ Datasets & Preparation (Yeast focus)

  • Regions centered on candidate poly(A) sites assembled from genomic counts + TIF-seq.
  • Inputs standardized to 300 bp;
    • BPNet: one-hot nucleotides.
    • SpeciesLM: 6-mer tokenization + species token (β€œyeast”).
  • Chromosome splits:
    • Train: I–XIV, XVI
    • Val: XV
    • Test: VII

🧠 Models & Training

Baseline β€” BPNet

  • Dilated CNN with residual connections.
  • Loss = profile (Multinomial NLL) + counts (MSE).

Language Model β€” SpeciesLM (DNABERT-style)

  • 12-layer encoder (~90M params), k-mer tokens, bidirectional self-attention.
  • Heads:
    1. Shallow 1D-conv reduction (768 β†’ 512 β†’ 256 β†’ 128 β†’ prediction)
    2. BPNet-style dilated stack on top of LM embeddings

LoRA Fine-tuning

  • Freeze base weights; train low-rank adapters inside attention/FFN layers.
  • Benefits: small trainable footprint, stable optimization, lower memory.

πŸ“Š Results

Test-set performance (best checkpoints by val loss):

Model Pearson (Median) Pearson (Mean) AUPRC AUROC Test Loss
BPNet 0.730 0.682 0.605 0.920 939.203
SpeciesLM + LoRA 0.809 0.739 0.640 0.931 711.484
SpeciesLM (frozen) 0.771 0.703 0.623 0.926 844.048

Key observation: LoRA delivers the strongest overall metrics and smooth optimization without full-model unfreezing.

πŸ” Motif Discovery (TF-MoDISco)

  • Poly(A) site signals: clear CA dinucleotide with downstream A-stretch and flanking T’s.
  • Positioning elements: AAUAAA, UAUAAA, occasionally AAAAAA variants.
  • Efficiency element (yeast): TA-rich patterns consistent with UAUAUA.
  • Caveat: Local attributions limit conclusions about global spacing/order between motifs.

🧰 Reproducibility

Clone the repository and install dependencies:

git clone https://github.com/<your-org>/<your-repo>.git
cd <your-repo>

1) Environment

# create and activate conda env
conda env create -f environment.yml
conda activate regulate-me

If using GPU, install a PyTorch build matching your CUDA version.

2) Data

# place your dataset here
mv saccharomyces_cerevisiae ./data

# one-time preprocessing
python scripts/preprocess_data.py

3) Training (Hydra configs)

# BPNet baseline
python scripts/train_model.py --config-name config

# SpeciesLM + shallow conv head (LoRA on by default)
python scripts/train_model.py --config-name lora

# SpeciesLM + BPNet-style head
python scripts/train_model.py --config-name llm_bpnet

Toggle LoRA (example):

python scripts/train_model.py --config-name lora model.use_lora=False

4) Sweeps (W&B)

wandb login
python scripts/run_sweep.py --config-name llm_bpnet

πŸ§‘β€πŸ€β€πŸ§‘ Contributions

  • MR β€” Biological research & interpretation; wrote polyadenylation background.
  • PN β€” Model design/implementation/training; attribution & MoDISco pipeline; methods & results.
  • HA β€” Results interpretation; model background; parts of discussion; feedback.
  • SZ β€” Interpretation pipeline (DeepLIFT & TF-MoDISco) implementation.

πŸ™Œ Acknowledgements

This project was conducted in ML4RG (SS24) at TUM. We thank our instructors and peers for feedback and support.