Comparison of digital pathology AI models, genomics classifier, and clinical variables in predicting progression free survival in TCGA prostate data set.

DOI: 10.1200/jco.2025.43.5_suppl.335 Publication Date: 2025-02-18T14:33:57Z
ABSTRACT
335 Background: Recent advances in AI have brought innovation to digital pathology problems with H&E whole slide images (WSI). Though some digital pathology AI (DPAI) models have been developed for survival analysis with TCGA data, there is a lack of understanding as to how DPAI models predict progression free survival (PFS) in the TCGA prostate dataset (PRAD). Nor is there a comprehensive comparison of DPAI models versus genomic or clinical models. Batch effects were also ignored in prior works. Methods: PRAD contains 392 patients with WSI and clinical data. WSI batch effect was first evaluated with HistoQC (1). We trained 4 different DPAIs to predict PFS in five-fold cross-validation (CV), controlling for the batch effects in the data splits. Such CV was repeated five times under different random splits. Each model’s performance was evaluated with mean C-index, whose confidence interval (CI) was approximated by the second smallest to the second largest value from the 25 splits. The same evaluation was used for the clinical or genomic model trained with regularized Cox regression. The clinical-only model used variables of age, Gleason group, and adverse pathology features. The genomic model was trained using expression of five prognostic genes from Mou et al. 2022. Multi-modal models using clinical variables with expression and/or DPAI features were evaluated in the same way. Results: Substantial WSI batch effects were discovered across the PRAD dataset; we excluded 9 patients whose WSIs were generated with a distinct scanner. With the remaining 383 patients, the clinical-only model achieved a mean C-index of 0.73 (0.63, 0.84) in the cross-validations, compared to that of 0.72 (0.58, 0.88) by the expression-only model. The clinical-genomic model achieved a better performance of 0.79 (0.65, 0.90). Among the DPAI models, CLAM attention model (2) initialized under the UNI foundation model (3) achieved best mean C-index of 0.68 (0.57, 0.82). Combining this DPAI model with clinical variables achieved a modestly larger C-index of 0.71 (0.59, 0.83). Interestingly, adding gene expression to this multi-modal model didn’t further improve performance. We also observed that ignoring batch effects during data-splitting consistently worsens performance. Conclusions: To the best of our knowledge, this is the first deep dive into PFS prediction on PRAD comparing DPAI with clinical or expression-based models. All modalities of data yielded similar C-indices with overlapping CIs, while the clinical-genomic model performed best. Controlling batch effects is important for DPAI models; future research is needed to explore if batch effect correction on WSI can bring DPAI’s performance closer to clinical-genomic models. Larger sample size is needed to adequately compare data modalities or build proper multi-modality models. 1. Janowczyk et al. 2019. 2. Lu, et al. 2020. 3. Chen, et al. 2023.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)