🔬 Predicting Kidney Transplant Survival with Machine Learning

Czech Technical University in Prague — Faculty of Nuclear Sciences and Physical Engineering

Note

Author: Peter Nutter
Supervisor: Ing. Tomáš Kouřim (Mild Blue, s.r.o.)
Consultant: Ing. Pavel Strachota, Ph.D. (Department of Mathematics, FJFI ČVUT)
Academic year: 2022/2023
Submitted: Prague, August 2, 2023

TL;DR

I built and compared several machine-learning models to predict kidney graft survival after transplantation. Using large-scale UNOS data (USA) and supportive analyses with IKEM data (Czech Republic), my best model—a DeepSurv neural network—achieved discrimination on par with the current literature and outperformed classic approaches like linear Cox regression and Random Survival Forests. The results point toward better donor–recipient matching and the possibility of richer scoring systems than those used today.

Warning

Looking back, it now seems like a straightforward project I could probably do in a tenth of the time—but going from studying mathematics to teaching myself data science, ML, and survival analysis from scratch was both rewarding and, at times, frustrating.

Why this matters

Chronic kidney disease affects 10%+ of the global population. Kidney transplantation saves lives, but getting the right organ to the right recipient at the right time is hard. Allocation currently relies on rule-based scores (e.g., KDPI/EPTS in the U.S.). With modern data, we can learn from past outcomes and predict graft longevity more directly—informing both policy and bedside decisions.

Data at a glance

UNOS (USA): 1M+ historical transplant records; filtered to 326,440 kidney transplants (2000–2022) after strict inclusion/cleaning.
IKEM (CZ): smaller living-donor dataset used for descriptive comparison and context.
Engineered features included dialysis duration (DIAL_LEN) and a clinically informed primary diagnosis grouping (DIAG_KI).

Methods

I framed graft failure as a time-to-event problem with censoring and evaluated models using survival-analysis metrics.

Models

Parametric: Exponential, Weibull, Gompertz
Semi-parametric: Cox PH (regularized and classical)
Neural: DeepSurv (non-linear Cox via MLP)
Ensemble: Random Survival Forests (RSF)

Tooling

Python stack: scikit-survival, lifelines, PySurvival, PyTorch
Training & tuning on the Helios cluster (CTU)

Metrics

Harrell’s C-index (primary), IPCW C-index, Integrated Brier Score (IBS), and time-dependent AUC
Evaluation on a held-out test set only

Key results

DeepSurv was best overall: C-index ≈ 0.665 for graft survival; also the top performer for patient-mortality prediction.
Cox PH (well-tuned) and Gompertz followed close behind (C-index ≈ 0.65).
RSF performed competitively but required much higher compute and lengthy training with limited gains.

Takeaway: With careful feature engineering and tuning, neural survival models provide a real, measurable edge—without sacrificing interpretability completely.

What the models learned (feature insights)

Using linear Cox for interpretability:

Recipient age (↑) and donor age (↑) increase hazard (shorter graft survival).
Kidney Donor Risk Index (KDRI) (↑) behaves as expected—higher risk, worse survival.
Dialysis duration before transplant emerged as a strong predictor, even exceeding raw waiting time.
Certain diagnosis groups (our 8-category consolidation) and diabetes status carry meaningful signal.
Some ethnicity categories showed differences in predicted survival; this is predictively relevant but ethically sensitive for deployment and requires policy guidance.

Practical implications (CZ & beyond)

Augment current scoring (e.g., KDPI/EPTS) with learned risk from additional covariates.
Prototype a clinical web tool (à la IChooseKidney) to communicate individualized survival curves and risk comparisons to clinicians and patients.
With better local data access, the pipeline can be adapted to the Czech Republic to reflect local practice and demography (e.g., IKEM/KST context).

(Technical) skills I learned

Survival ML stack: Implemented specialized libraries (scikit-survival, lifelines, PySurvival, PyTorch) for handling censored data and partial likelihoods
Proper evaluation: Used multiple metrics (C-index, IBS, time-dependent AUC) to assess both discrimination and calibration
Efficient data processing: Managed large datasets (100s GB) with optimized storage formats and memory-safe operations
High-performance computing: Executed model training on the Helios compute cluster with job scheduling and checkpointing
Clinical communication: Translated technical findings into clinically relevant insights and clarified domain-specific definitions
Research methodology: Independently scoped the project, reproduced baseline results from literature, and validated design choices

Limitations & future work

External validation on independent cohorts is crucial before clinical use.
RSF and some deep models are computationally heavy; more efficient implementations and hardware would help.
Expand to treatment-aware modeling (e.g., immunosuppression changes, time-varying covariates).
Co-design with clinicians to ensure the tool is useful, fair, and trustworthy in practice.

Acknowledgments

I’m grateful to Ing. Tomáš Kouřim for expert guidance, and to Ing. Pavel Strachota, Ph.D. for exceptional editorial support. Thanks to UNOS/OPTN for data access and to Mild Blue and IKEM collaborators for context and inspiration.

Open code & reproducibility

Code: https://github.com/peterstran/ml-unos2
Thesis text: https://dspace.cvut.cz/handle/10467/111520
Data note: UNOS/OPTN data as of September 2022; interpretations are mine and do not represent official policy of OPTN or the U.S. Government.