- We build an end-to-end ML pipeline for regulatory genomics: data prep, tokenizer/species tags, training (BPNet and SpeciesLM), hyperparameter sweeps, evaluation, and motif interpretation with DeepLIFT β TF-MoDISco.
- LoRA-tuned SpeciesLM/DNABERT outperforms BPNet and a frozen-LM on loss, Pearson correlation, and AUROC for poly(A) site prediction.
- The pipeline yields interpretable biology: recovery of AAUAAA/variants, UAUAUA, and CA at the cut site, with expected variability.
- Efficient and practical: adapter-based fine-tuning keeps compute and memory low while improving accuracy.
π Shortcuts
π Abstract
The genome encodes regulatory signals beyond protein-coding regions. In particular, 3β²-UTR polyadenylation relies on upstream/downstream sequence elements whose diversity complicates classical alignment-based analyses. We frame poly(A) site prediction as a sequence modeling task and evaluate:
- BPNet as a strong, interpretable CNN baseline, and
- a Species-aware DNA LM (DNABERT/SpeciesLM) with (i) a shallow 1D-conv prediction head or (ii) a BPNet-style head.
We fine-tune the LM using Low-Rank Adaptation (LoRA) to efficiently adapt encoder blocks. Our best model (SpeciesLM + LoRA) achieves lower validation/test loss, higher Pearson correlation, and better AUROC than both the BPNet baseline and a frozen-LM alternative. With TF-MoDISco, we validate that learned features align with literature-reported polyadenylation elements, while noting limitations in motif spacing/order recoverability from local attributions.
π Highlights
- Models: BPNet baseline vs. SpeciesLM (DNABERT-style encoder, 12 layers, ~90M params).
- Fine-tuning: LoRA on encoder blocks β ~2.4M trainable params instead of full 90M.
- Heads: (a) shallow 1D-conv; (b) BPNet-like dilated conv stack.
- Interpretability: DeepLIFT β TF-MoDISco recovers AAUAAA/UAUAAA/AAAAAA, UAUAUA, and CA motifs.
- Takeaway: LoRA outperforms parameter-heavy alternatives while remaining compute-friendly.
π§ͺ Datasets & Preparation (Yeast focus)
- Regions centered on candidate poly(A) sites assembled from genomic counts + TIF-seq.
- Inputs standardized to 300 bp;
- BPNet: one-hot nucleotides.
- SpeciesLM: 6-mer tokenization + species token (βyeastβ).
- Chromosome splits:
- Train: IβXIV, XVI
- Val: XV
- Test: VII
π§ Models & Training
Baseline β BPNet
- Dilated CNN with residual connections.
- Loss = profile (Multinomial NLL) + counts (MSE).
Language Model β SpeciesLM (DNABERT-style)
- 12-layer encoder (~90M params), k-mer tokens, bidirectional self-attention.
- Heads:
- Shallow 1D-conv reduction (768 β 512 β 256 β 128 β prediction)
- BPNet-style dilated stack on top of LM embeddings
LoRA Fine-tuning
- Freeze base weights; train low-rank adapters inside attention/FFN layers.
- Benefits: small trainable footprint, stable optimization, lower memory.
π Results
Test-set performance (best checkpoints by val loss):
Model | Pearson (Median) | Pearson (Mean) | AUPRC | AUROC | Test Loss |
---|---|---|---|---|---|
BPNet | 0.730 | 0.682 | 0.605 | 0.920 | 939.203 |
SpeciesLM + LoRA | 0.809 | 0.739 | 0.640 | 0.931 | 711.484 |
SpeciesLM (frozen) | 0.771 | 0.703 | 0.623 | 0.926 | 844.048 |
Key observation: LoRA delivers the strongest overall metrics and smooth optimization without full-model unfreezing.
π Motif Discovery (TF-MoDISco)
- Poly(A) site signals: clear CA dinucleotide with downstream A-stretch and flanking Tβs.
- Positioning elements: AAUAAA, UAUAAA, occasionally AAAAAA variants.
- Efficiency element (yeast): TA-rich patterns consistent with UAUAUA.
- Caveat: Local attributions limit conclusions about global spacing/order between motifs.
π§° Reproducibility
Clone the repository and install dependencies:
git clone https://github.com/<your-org>/<your-repo>.git
cd <your-repo>
1) Environment
# create and activate conda env
conda env create -f environment.yml
conda activate regulate-me
If using GPU, install a PyTorch build matching your CUDA version.
2) Data
# place your dataset here
mv saccharomyces_cerevisiae ./data
# one-time preprocessing
python scripts/preprocess_data.py
3) Training (Hydra configs)
# BPNet baseline
python scripts/train_model.py --config-name config
# SpeciesLM + shallow conv head (LoRA on by default)
python scripts/train_model.py --config-name lora
# SpeciesLM + BPNet-style head
python scripts/train_model.py --config-name llm_bpnet
Toggle LoRA (example):
python scripts/train_model.py --config-name lora model.use_lora=False
4) Sweeps (W&B)
wandb login
python scripts/run_sweep.py --config-name llm_bpnet
π§βπ€βπ§ Contributions
- MR β Biological research & interpretation; wrote polyadenylation background.
- PN β Model design/implementation/training; attribution & MoDISco pipeline; methods & results.
- HA β Results interpretation; model background; parts of discussion; feedback.
- SZ β Interpretation pipeline (DeepLIFT & TF-MoDISco) implementation.
π Acknowledgements
This project was conducted in ML4RG (SS24) at TUM. We thank our instructors and peers for feedback and support.