Start Over

A modular protein language modelling approach to immunogenicity prediction.

Authors :: O'Brien, Hugh
Salm, Max
Morton, Laura T.
Szukszto, Maciej
O'Farrell, Felix
Boulton, Charlotte
King, Laurence
Bola, Supreet Kaur
Becker, Pablo D.
Craig, Andrew
Nielsen, Morten
Samuels, Yardena
Swanton, Charles
Mansour, Marc R.
Hadrup, Sine Reker
Quezada, Sergio A.
Source :: PLoS Computational Biology. 11/11/2024, Vol. 20 Issue 11, p1-23. 23p.
Publication Year :: 2024
Abstract: Neoantigen immunogenicity prediction is a highly challenging problem in the development of personalised medicines. Low reactivity rates in called neoantigens result in a difficult prediction scenario with limited training datasets. Here we describe ImmugenX, a modular protein language modelling approach to immunogenicity prediction for CD8+ reactive epitopes. ImmugenX comprises of a pMHC encoding module trained on three pMHC prediction tasks, an optional TCR encoding module and a set of context specific immunogenicity prediction head modules. Compared with state-of-the-art models for each task, ImmugenX's encoding module performs comparably or better on pMHC binding affinity, eluted ligand prediction and stability tasks. ImmugenX outperforms all compared models on pMHC immunogenicity prediction (Area under the receiver operating characteristic curve = 0.619, average precision: 0.514), with a 7% increase in average precision compared to the next best model. ImmugenX shows further improved performance on immunogenicity prediction with the integration of TCR context information. ImmugenX performance is further analysed for interpretability, which locates areas of weakness found across existing immunogenicity models and highlight possible biases in public datasets. Author summary: Accurate prediction of neoantigen immunogenicity has the potential to greatly improve the effectiveness of targeted therapies for cancer. While there are a number of associated tasks, such as peptide-HLA elution prediction, which can be now predicted with high accuracy by a number of published models, direct prediction of which epitopes will produce an immune response has proven more difficult. In this paper we demonstrate a modular protein language model approach which is trained iteratively to include data from related sub-tasks and can be extended to include information such as candidate TCRs of interest when available. There is a relatively small amount of immunogenicity data available, with even less data available with paired TCRs. This makes directly training a model to predict immunogenicity challenging. Our approach has the advantage of utilising data from sub-tasks and masked language modelling to allow for training a highly performant model with a small dataset. Using a cancer-specific benchmarking dataset we show this approach improves on existing state-of-the-art models and can be improved further with the addition of TCR context. This provides a framework that can serve as the basis for utilising additional information sources and datasets as they become available. [ABSTRACT FROM AUTHOR]