DeepLC can predict retention times for peptides that carry as-yet unseen modifications

The inclusion of peptide retention time prediction promises to remove peptide identification ambiguity in complex liquid chromatography–mass spectrometry identification workflows. However, due to the way peptides are encoded in current prediction models, accurate retention times cannot be predicted for modified peptides. This is especially problematic for fledgling open searches, which will benefit from accurate retention time prediction for modified peptides to reduce identification ambiguity. We present DeepLC, a deep learning peptide retention time predictor using peptide encoding based on atomic composition that allows the retention time of (previously unseen) modified peptides to be predicted accurately. We show that DeepLC performs similarly to current state-of-the-art approaches for unmodified peptides and, more importantly, accurately predicts retention times for modifications not seen during training. Moreover, we show that DeepLC’s ability to predict retention times for any modification enables potentially incorrect identifications to be flagged in an open search of a wide variety of proteome data. DeepLC, a deep learning-based peptide retention time predictor, can predict retention times for unmodified peptides as well as peptides with previously unseen modifications.

L iquid chromatography plays a critical role in mass spectrometry (MS) analysis of bottom-up proteomics 1 . By separating peptides based on their physicochemical properties in the liquid chromatography step, the complexity of the sample presented to the MS instrument is greatly reduced. This reduction means that there is less ionization competition, improved sensitivity for data-dependent or -independent analysis and reduced chimericity in fragmentation spectra (MS 2 ) 2, 3 . In addition to these benefits, the retention time measurement itself provides an additional dimension of information to interpret the signals generated by a peptide 4,5 . To interpret these acquired signals, they need to be matched with earlier observations of the same peptides or with a prediction of the signal.
The process by which a peptide is retained or eluted is not fully understood yet 6 , which means that libraries with previously observed retention times are often used to match newly acquired signals 7 . However, these libraries are often incomplete and nontransferable between experimental setups without calibration. To fill this knowledge gap, researchers have therefore used models to predict retention times for previously unobserved peptides 5 .
Many of the first methods for peptide retention time prediction relied on simulation models based on physicochemical knowledge 8 . In 1980, the first linear regression model that solely used total amino acid composition for peptide retention time was published 9 . In 2002, a method was proposed for incorporation of these predictions for increasing identification rates of proteins 10 . Improvements were then made to this modeling process, for example, by taking the positional peptide context into account 11 . Most modern approaches now use data-driven methods such as machine learning or deep learning algorithms to train a predictive model 10,[12][13][14][15] . In these models, the mapping between the peptide sequence (or features derived from this sequence) and the liquid chromatography retention time apex is learned from empirical examples. After training, any of the aforementioned models can be used to generate predictions for unobserved peptides.
Such retention time prediction models have already been successfully applied for various tasks in proteomics analysis workflows, for example to improve identification confidence 16,17 , to design more efficient experiments 18 and to identify chimeric fragmentation spectra 19 . Most recently, these retention time prediction models have been used in combination with fragment peak intensity prediction models to generate comprehensive, in silico chromatogram libraries for data independent acquisition (DIA) identification, effectively replacing and surpassing more limited, empirically derived data-dependent acquisition spectral libraries [20][21][22] .
In keeping with the general trend in machine learning, there has been a switch from classical machine learning to deep learning in newly developed retention time predictors. This switch was mainly driven by recent innovations in the field of deep learning and the large amount of peptide retention time data that has become available. Because a deep learning network learns its own peptide representation, these models usually allow for more accurate predictions 23 . The types of architecture proposed by state-of-the-art deep learning retention time models include capsule convolutional neural networks (CNNs) in DeepRT(+) 15 , neural networks with long short-term memory layers as used by Guan et al. 13 and an encoder-decoder principle with gated recurrent units in Prosit 14 . The architectures of these models either work with a CNN or recurrent architecture (for example, long short-term memory or gated recurrent units). CNN architectures slide a filter with a specified kernel size over the encoded peptide. In contrast, recurrent neural networks thread the sequence through a sequence of units.
All of these models share the same peptide encoding method, in which amino acids and their corresponding positions are transformed into a one-hot amino acid encoding. This encoding takes the form of a matrix in which the presence or absence of each amino acid for each position in the peptide is represented by a one or a zero, respectively. This use of one-hot encoding of amino acids restricts the models' applicability in some of the most interesting data analysis workflows, most notably in open searches where the goal is to elucidate the modification landscape of the proteome 24-27 . DeepLC can predict retention times for peptides that carry as-yet unseen modifications Robbin Bouwmeester 1,2 , Ralf Gabriels 1,2 , Niels Hulstaert 1,2 , Lennart Martens 1,2 ✉ and Sven Degroeve 1,2 The inclusion of peptide retention time prediction promises to remove peptide identification ambiguity in complex liquid chromatography-mass spectrometry identification workflows. However, due to the way peptides are encoded in current prediction models, accurate retention times cannot be predicted for modified peptides. This is especially problematic for fledgling open searches, which will benefit from accurate retention time prediction for modified peptides to reduce identification ambiguity. We present DeepLC, a deep learning peptide retention time predictor using peptide encoding based on atomic composition that allows the retention time of (previously unseen) modified peptides to be predicted accurately. We show that DeepLC performs similarly to current state-of-the-art approaches for unmodified peptides and, more importantly, accurately predicts retention times for modifications not seen during training. Moreover, we show that DeepLC's ability to predict retention times for any modification enables potentially incorrect identifications to be flagged in an open search of a wide variety of proteome data.
These open search workflows are gaining popularity in the field of proteomics as they make it possible to search for a large variety of peptide modifications simultaneously. Unfortunately, current retention time prediction methods cannot be directly applied in open searches because of the vast number of potential modifications 28 . With one-hot amino acid encoding, each potential modification must be represented by a binary feature indicating the presence of this modification. Additionally, sufficient training examples are required for each modification for the machine learning algorithm to learn the hidden impact of every one of these modifications on the peptide retention time.
Here, we solve this fundamental issue with DeepLC, our retention time predictor that is able to accurately predict the retention time for all peptides and their modifications, even when these modifications have not been seen during training. DeepLC achieves this by encoding peptides and modifications at the atomic composition level, allowing generalization of the patterns learned from the modifications seen during training.

Results
The results section is split in two main parts. We first evaluate the performance of DeepLC on retention time prediction for unmodified peptides. We then proceed to evaluate DeepLC's unique ability to predict retention times for modified peptides. We rely on two distinct ways of evaluating DeepLC's performance on these modified peptides: (1) evaluate performance on unseen modifications, and (2) evaluate performance by treating unmodified amino acids as modified glycines. These evaluations show that DeepLC is not only competitive with state-of-the-art retention time prediction algorithms for unmodified peptides, but can also achieve similar performance for unseen modified peptides. Finally, we illustrate the unique capability of DeepLC by flagging potential false positive identifications in open searches of a variety of human tissue data sets.
Evaluation on unmodified peptides. Our approach to model amino acids by their atomic composition provides accurate predictions of liquid chromatography retention times for unmodified peptides (including carbamidomethylation of cysteine and oxidation of methionine), with similar performance to current state-of-the-art retention time prediction models DeepRT 15 , Prosit 14 and Guan et al. 13 that model amino acids directly.
DeepLC test set predictions for the three selected data sets are plotted in Fig. 1. We observe very high prediction accuracy, with Pearson correlations larger than 0.98 for all three data sets. Hela HF data show a slightly worse performance, but here the liquid chromatography gradient is substantially shorter, indicating that retention times become less predictable. Indeed, this negative effect of shorter gradients on resolution and peak capacity are well known 29 , making apex peptide elution times less predictable. Figure 1 also reveals a small but substantial number of peptides with high prediction errors. These are potentially wrong identifications or wrongly determined elution apexes. Most of these outliers fall within the worst 1% of predictions ( Supplementary Fig. 1) and we expect up to 1% incorrect identifications due to the same set false discovery rate (FDR). The same plots for the other 17 data sets can be found in Supplementary Figs. 2 and 3 where we make very similar observations to those in Fig. 1.
Supplementary Table 1 summarizes the test set performance for all 20 data sets described in the data sets and evaluation section of the Methods. The atomic composition encoding approach of DeepLC is able to learn accurate prediction models with high R values for all data sets. Correlation provides a measure for how much variance is, and is not, explained by these predictions and allows for comparison between different liquid chromatography setups. For most data sets, DeepLC achieves an R above 0.98, with four data sets even achieving correlations above 0.995. This R value is highly comparable to the other models. Nearly all data sets were obtained with reverse phase columns, yet even though there are fewer data sets with hydrophilic interaction chromatography and strong ion exchange chromatography, DeepLC also performs very well on these data sets with relative mean absolute error (MAE) errors below 1.5%.
For the Δt 95% metric the differences are more pronounced. This metric describes the error for a retention time window that contains 95% of the peptides in the error distribution and is thus very sensitive to outliers. Here we observe that DeepLC performs consistently worse than the other models. It is, however, unclear whether these differences should be attributed to the atomic composition encoding, a different deep learning architecture, a difference in train-validation-test split (note that for the other prediction models, the paper does not mention the use of a validation set) or a combination of these. As we want to focus on the capability of DeepLC to predict retention times for modified peptides we leave this question open for further research.
The trained models are also highly transferrable between different data sets. This transferability is especially useful when applying models trained on larger data sets to smaller ones and application of the models without retraining. Only a simple calibration is required to transfer the predictions between liquid chromatography   setups. Supplementary Fig. 4 shows that models that achieve high performance on a given data set also show high Spearman correlation when applied to different data sets. The only exception is when different stationary phases are used, for example hydrophilic interaction liquid chromatography that retains hydrophilic peptides instead of hydrophobic peptides for reverse phase.
DeepLC builds on a deep learning approach that greatly benefits from a large number of training peptides, and we can show that large data sets indeed do have a positive influence on the performance of DeepLC. The performance on each individual data set in relation to the number of training peptides is shown in Fig. 2 and Supplementary Table 1. Data sets with a very small number of training peptides (<10,000) tend to have a performance between 2 and 4.5% relative MAE. For medium sized data sets (>10,000 and <75,000 peptides), the performance can vary widely, with relative MAE's ranging from 0.9 to 4.5%. For larger data sets (>75,000 peptides) the performance tends to be below 2% relative MAE. These larger data sets start to converge in terms of performance.
To further evaluate the relationship between the number of training peptides and prediction performance we computed learning curves for the three selected data sets (Fig. 3). These curves show a sharp improvement for the first four to five steps (comprising up to 50% of the total number of training peptides). Beyond these steps, prediction performance improves only linearly for the SWATH Library and HeLa HF, while showing smaller improvements in the last step for DIA HF. For two of these data sets, the performance continues to improve right to the last step of the learning curve. This ability to continuously improve performance suggests that DeepLC, like most other deep learning approaches, is capable of fitting even more complex relations than classical machine learning when provided with sufficient data. The same observation of increasing performance for larger training sets can be made for 15 of the remaining 17 data sets (Supplementary Figs. 5 and 6).
Evaluation on modified peptides. DeepLC is able to generalize effectively for unmodified peptides as well as extend its retention time predictions to modifications that were not included in the training set. We can thus show that the DeepLC models have not just learned the general shift in retention time caused by modifications, but also how this shift depends on the context of the modification in the peptide.
Prediction performance for modified peptides would ideally be evaluated on a large data set with a variety of modifications. Indeed, as shown in Fig. 3, the full performance potential of DeepLC is achieved by the largest possible data set size. However, such large data sets with many modifications are currently not available in the public domain.
Instead, we show DeepLC's prediction performance here for modified peptides on a recently published smaller data set (ProteomeTools PTM 30 ). Furthermore, we introduce an evaluation procedure that allows the use of larger data sets based on the fact that any amino acid can be considered a modified glycine.
We first evaluate DeepLC on all 14 modifications in the ProteomeTools PTM data set. We trained and optimized 14 DeepLC models where each model only saw peptides that did not contain a specific modification. Each model was then evaluated on the remaining peptides, which all contained the modification that was excluded during training. We created two test sets from these remaining peptides to evaluate predictions: one where the excluded modification was encoded and one where it was not. Prediction performance for both test sets were then evaluated and compared. This comparison thus allows performance to be assessed on a modification that is not included in training in terms of the improvement that DeepLC offers over a baseline of simply ignoring the presence of the modification. Figure 4 and Supplementary Fig. 7 show the prediction errors for each of the left-out modifications for training. Figure 4 shows the performance when a given modification was not present in the training set for the model and afterward was either not encoded (red boxplots, baseline) or encoded (blue boxplots) during the predictions. It should be noted that many modifications did not cause a substantial change in terms of predicted retention time, as was also observed in the original paper for this data set 30   performance increase when these modifications were encoded during the predictions. For instance, Supplementary Fig. 7 shows that the MAE was improved by 700% (from 462 to 66 s) for propionyl. These improvements were mainly due to the correct prediction by DeepLC of the shift in retention time caused by the modification. Most importantly, besides a substantially decreased MAE, the correlation R also showed a substantial improvement. This is shown in Fig. 4 through the substantially smaller variance for the blue boxplots. For crotonyl, for instance, Supplementary Fig. 7 presents an increase of R from 0.975 to 0.990 when encoding the modification in the test set. This means that the DeepLC models did not just learn the general shift in retention time caused by modifications, but also how this shift depended on the context of the modification in the peptide.
Only nitrotyrosine and phosphorylation modifications show a substantially lower performance when encoded, but these modifications can be classified as physicochemically very different from the other modifications. This inability of DeepLC to accurately predict retention times for modifications that are chemically very different from anything encountered the training set indicates that even DeepLC requires some relevant training data for a given class of modifications.
In the second evaluation procedure, we used the larger DIA HF and the smaller HeLa HF data sets to train and optimize 19 DeepLC   models, where each model only saw peptides that did not contain a specific amino acid. The nomenclature was as above, in which a model was trained and optimized on peptides that did not contain a specific amino acid. Next, each model was evaluated on peptides that contained the amino acid excluded from training. For this, we again created two test sets from these remaining peptides: one where the excluded amino acid was encoded as the composition of glycine only and one with its actual composition. We show that encoding an amino acid as itself instead of as glycine improves the MAE for most amino acids (Fig. 5). DeepLC performed very well when modeling large hydrophobic residues as modified glycines, and slightly less well when modeling polar uncharged and negatively charged residues. Finally, for the positively charged amino acids only arginine showed an improvement, while lysine and histidine decreased in performance. The poor performance for lysine can be explained by the difference between the amino acid and the closest atomic composition. For lysine, the closest atomic compositions are arginine and leucine (or isoleucine), which are substantially less hydrophobic or more hydrophobic, respectively. As shown previously, DeepLC was unable to extrapolate to unseen modifications that were very different in composition.
This nonmodified amino acid evaluation shows that performance is slightly worse in comparison to including the amino acid in the training set, with DIA HF and HeLa DeepRT having MAEs of 2.37 and 3.2 min, respectively. The MAE errors shown in Fig. 5 are about 1.5 to 2.5 times higher.
It is important to note that this evaluation is harsh because the trained model has never seen a given amino acid and, moreover, because peptides that are similar to each other are likely to all be excluded from training due to these peptides having a higher likelihood of also containing the removed amino acids. This can create biased training sets, especially for lysine and arginine as most peptides are tryptic. However, the model is still able to predict retention times very accurately for amino acids that were not used in training.
Evaluation of open identifications. Predicted retention times have the potential to overcome the identification ambiguity issue 23,31 .
Because of DeepLC's unique capability to accurately predict retention times of (unseen) modifications, these predictions can be applied to open searches, where identification ambiguity is a key problem 25 . Indeed, open searches introduce considerable ambiguity through the very large number of possible modifications considered, which can be reduced through orthogonal measurements such as retention time.
DeepLC was applied to the results of an open search of human tissue data 32 using Open-pFind 26 . Figure 6a shows observed retention time plotted against predicted retention time for the resulting peptide spectrum matches (PSMs). While the retention time was accurately predicted for PSMs with a Q value <0.01, much higher retention time errors were observed for PSMs with Q ≥0.01. The group of peptides with a Q ≥ 0.01 also showed clustering around the predicted retention time of 1,000. These are PSMs for very hydrophilic peptides that were predicted to be nonretained. Figure  6b shows that PSMs with a Q ≥ 0.01 did have a higher error, but the mode was still around zero. There is no substantial difference observed in the error distributions for unmodified and modified peptides ( Supplementary Fig. 8). This indicates that we mostly expect false identifications to have their mode around zero with a large deviation from this mode.
The error distribution of filtered PSMs (Q ≥ 0.01) is now compared to distributions of selected modifications to flag suspect modification groups, as the error distribution of suspect modifications is expected to be similar to filtered PSMs. Boxplots of the error distributions for four subsets of modifications are shown in Fig. 6c (see Supplementary Figs. 9-13 for all modifications). The subset containing the ten most found modifications all show a low error spread that is within 5% of the maximum elution time (300 s) and is generally centered around zero. The subset of modifications with the largest errors are in the range of 25% of the maximum elution time (1,500 s) and are widely spread around 0 s. A notable exception is dethiomethyl with a shifted median retention time error of 1,000 s. This shift can be explained by in-source fragmentation, which causes oxidized methionine to lose its side chain 33 . If in-source fragmentation is the cause, then the observed retention times are expected to be based on the oxidized methionine equivalent. To verify whether this is the case, PSMs with dethiomethyl were replaced with their oxidized methionine precursors. Figure 6d shows that replacing detiomethyl with oxidized methionine reduced the predicted retention time error to around the same level as expected for oxidized methionine peptides in Fig. 6c. The subset of modifications with large errors shows very similar patterns to the next subset, which contains modifications that are not expected to occur in the sample, as these are experimentally induced and thus should not be found in the untreated biological sample. In effect, the similarity between these two subsets of modifications indicates that modifications with large retention time errors according to DeepLC can be flagged as highly suspect. These PSMs with unexpected modifications were then inspected for their associated second rank PSM in Fig. 6e. For all PSMs, first ranked PSMs are shown to have the narrowest error distribution. In contrast, when considering only PSMs with unexpected modifications, the second ranked PSMs display the narrower error distribution. This difference indicates that DeepLC might well be able to select better alternatives for these unexpected modifications, as judged by the generally better fit of these alternatives' retention times.
Finally, the last subset singles out presumed detection of mutations that show similar or worse error distributions than the previous two subsets. The observation that presumed mutations are among the most problematic corresponds to the known nontrivial nature of reliably identifying such sequence changes 31,34 .
These results thus demonstrate that the unique capabilities of DeepLC allow it to be used as an orthogonal measure to flag suspect identifications in open searches. As a proof-of-concept, we here show that a comparison of error distributions between expected and potentially falsely identified modifications can select those modification distributions that are likely the result of incorrect identifications. This method allows the most suspect modifications to be selected, and can be configured to be more conservative or more lenient based on the needs of the analysis.   (11) Trp->Met (-2) Tyr->Lys (21) Trp->Lys (-6) Trp->His (11) Val->His* (77) Tyr->Thr (14) Trp->Ala (-3) Tyr->Asn (17) 0 1,000 2,000 3,000 4,000 Observed retention time (s) 5,000 6,000 7,000 The 25% error distributions with largest distance are marked as substantially different. Modifications are followed by the number of peptides identified with the modification in brackets. Boxplots show the median, Q1 (25%), Q3 (75%) and whiskers at Q ± 1.5 (Q3-Q1). d, PSMs identified with a dethiomethyl modification in CD8 T cell data, and predictions for these but with dethiomethyl replaced with oxidation. e, error distributions as violin plots for all rank one PSMs (Q < 0.01) and their corresponding second ranked PSMs. The same rank one and rank two PSMs are visualized for rank one identifications with unexpected modifications.

Discussion
Our evaluation shows that DeepLC performed similarly to current state-of-the-art models for unmodified peptides, but DeepLC could accurately predict the retention time of modified peptides, even for modifications that were not included in the training set. This ability to predict for unseen modifications was evaluated with a two-pronged evaluation strategy using both unmodified peptides as well as synthetic, modified peptides. For both evaluations, encoding modifications for prediction improved performance, while performance was reduced only for specific modifications that were very different from any other structure in the data set. Finally, the potential of this unique capability of DeepLC was illustrated through its ability to flag suspect PSMs in an open search. Crucially, DeepLC showed much larger prediction errors for PSMs that carried modifications that were certain to be absent from the sample. Future development of models that can predict the retention time for unseen modifications could focus on the structural aspects of modifications. DeepLC is currently limited in differentiating between isomeric structures that are physicochemically different. Indeed, the observation that structure, not just atomic composition, leads to the physicochemical properties of molecules has already been observed for small molecules 35,36 . Here, the decision was made to work with atomic composition because of the ready availability of the composition in databases such as Unimod, and the greater ease of integration when compared to more complex structural descriptors.
DeepLC enables the field to generate predictions for a wide landscape of modification. To improve the availability to researchers and their use cases, DeepLC is freely available online and has a user-friendly graphical user interface (GUI). Furthermore, the tool is available in code repositories that enable easy incorporation in workflows and pipelines for automatic predictions.

online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41592-021-01301-5.

Methods
Architecture. DeepLC uses a convolutional deep learning architecture with four different paths for a given encoded peptide. The same peptide acts as the input for the four paths, which have multiple separated layers, as shown in Supplementary  Fig. 14. Three of the initial paths use a combination of convolutional 37 and maximum pooling layers 38 . The paths with convolutional layers use a sliding window filter approach to encode local structure in the peptide. The maximum pooling layers further generalize the encoding of the convolutional filters by only propagating the maximum activations. The remaining path, which propagates global features, consists of densely connected layers. These densely connected layers do not take local structures into account, but this is not required as this path is meant to encode global structure only. The results of all initial four paths are flattened and concatenated to provide an input for the final combined path that consists of six connected dense layers. A detailed visualization of the architecture is available in Supplementary Fig. 15.
The input matrix for the amino acids composition path has a dimension of 60 for the peptide sequence by six for the atom counts (C, H, N, O, P and S). Not every peptide is 60 amino acids long, thus 'X'-characters without atomic composition are padded to reach 60 amino acids. This indicates that encoding modified amino acids becomes straightforward, as computing their atomic composition is trivial. Note that for modified amino acids, the atomic composition of the modification is added to the atomic composition of the unmodified residue. This encoding allows the model to learn patterns that generalize to unseen modifications.
The diamino acid path is added to further improve the generalization capability of the model. In this layer, the peptide is divided into diamino acids without overlap. This improves the generalization capability, as the input values for each position are more thoroughly represented. Otherwise there would only be 20 unmodified amino acid representations, combined with a limited amount of modifications. Besides interpreting the amino acids in pairs, the diamino acid path uses the same logic as the amino acids composition path, leading to an input matrix of 30 paired positions by six atoms.
Encoding amino acids and their modifications by strictly using the atomic composition does, however, not allow for comprehensively capturing all molecular information. Indeed, the structure of isomers can play an important role in the physicochemical properties of amino acids, as is exemplified by structural isomers isoleucine and leucine 39 . This is the reason that one-hot encoding of unmodified amino acids is still used in DeepLC as an input for the One-hot encoding path. However, to reduce the impact of this layer, the number of filters for this path are limited to two. The dimensions of this input matrix are 60 positions by 20 amino acids.
In addition to all paths that encode position specific information, the Global features path takes global information of the peptide into account. These global features include the length and total atomic composition of the peptide. In addition to these global counts and length, the atomic composition of the first and last four positions of the peptide are encoded. This adds a 6 × 8 feature matrix, or a flattened feature vector of 48. The dimension of this input vector is 55.
Three versions of the model were trained, solely differing in kernel sizes (of 2, 4 and 8) for the amino acids composition path. These three models were combined in an ensemble by averaging their predictions. This strategy is similar to the ensemble used in DeepRT+ (ref. 15 ) and ensures adaptability to different data sets that might require encoding of longer local peptide structures.
The paths were optimized on the validation set of the SWATH Library data set. This optimization consisted of selecting the number of convolutional and maximum pooling blocks for the amino and diamino acids composition paths that yielded a lower MAE. For the diamino acids composition path, we chose to not encode redundant information and thus the encoding was nonoverlapping. The rationale was to limit the already redundant information within and between the diamino and amino acids composition paths. However, as with many of the architecture decisions, there is no guarantee that the chosen hyperparameters of the architecture provide a global or even local optimum. Inspection of the weights of the dense layer after concatenation shows that all paths propagate activations and thus contribute to the predictions (Supplementary Fig. 16).
Finally, the other hyperparameters of each layer in DeepLC are consistent for all versions with different kernel sizes. All layers, except the output layer and the one-hot encoding path, use L1 regularization with α = 2.5 × 10 −7 and a leaky ReLU 40 with a maximum activation value of 20. The one-hot encoding path uses the tanh activation function, as within this path we are only interested in the ability to separate unmodified amino acid isomers.
Data sets and evaluation. To evaluate the generalization performance of DeepLC, we selected 20 data sets from a wide variety of organisms and experimental setups (Supplementary Table 2). We further selected three data sets (SWATH Library 41 , HeLa HF 42 and DIA HF 43 ) for detailed result reporting, with the results for the other 17 data sets described in the Supplementary Information. The data sets SWATH Library and DIA HF were selected based on their previous use by Ma et al. for DeepRT 15 and by Guan et al. 13 , respectively. A third data set, HeLa HF was selected because of its use of short (compared to other used data sets) gradients of 15 min and the large number of training peptides. Only unique peptidoforms (peptide modifications combination) were used for training or, as indicated by the reference, the previously published data set was used.
The variety in experimental setups and protocols means that the acquired and predicted retention times had to be calibrated. The ProteomeTools library 44 , SWATH Library and DIA HF data sets were normalized to the indexed retention time (iRT) peptides 45,46 . DeepLC itself supports linear calibration that is similar to iRT calibration 45 , but users can supply their own high-confidence identification. This calibration procedure is further explained in the online DeepLC documentation.
The data sets marked 'custom workflow' in Supplementary Table 2 were processed as follows. Raw MS files were downloaded from PRIDE Archive 47 and converted to MGF format with the ThermoRawFileParser 48 . These were then searched using the MS-GF+ search engine 49 with a concatenated target-decoy sequence database containing the respective species' UniProtKB proteome and the common Repository of Adventitious Proteins (https://www.thegpm. org/crap/). Carbamidomethylation of cysteine was set as a fixed modification, oxidation of methionine and acetylation of protein N-termini were set as variable modifications. The full MS-GF+ configuration files for each data set are available on Zenodo. The MS-GF+ search results were postprocessed with Percolator 50 to a FDR of 0.01. Retention times were parsed from the MGF files for all confidently identified peptides. Within each liquid chromatography-mass spectrometry (LC-MS) run, the median retention time for each peptidoform (peptide modifications combination) was calculated. All median retention times were then linearly calibrated across all LC-MS runs for each data set, using the shared peptidoforms as anchor points. Finally, the median calibrated retention time was calculated for each peptidoform across all runs for each data set. These median calibrated retention times were then used to train, validate and test DeepLC. The full custom workflow, including this calibration step, is available in a Snakemake workflow 51 .
The data sets marked Custom workflow ProteomeTools in Supplementary  Table 2 were processed as follows. MaxQuant 52 identification files were filtered on posterior error probabilities <0.01 and scores >90. The retention times were calibrated with the peptides in Supplementary Table 3. Within a run, the median retention time per peptidoform was used for further analysis. Then, across runs the median retention time per peptidoform was taken for the final retention time.
The data sets marked custom workflow ProteomeTools PTM 30 in Supplementary Table 2 were processed in the same way as the data sets custom workflow ProteomeTools, with the only exception that the retention times were calibrated with the peptides in Supplementary Table 4. A few modifications from the original publication were either grouped or ignored. The modifications 'hydroxyproline' and 'hydroxyisobutyrylation' were grouped under 'oxidation' . The modifications 'monomethylation' and 'dimethylation' on both arginine and lysine were grouped. The modifications 'glutarylation' and 'glyglycylation' were excluded due to the same naming scheme that did not allow for discriminating between them. Finally, 'biotinylation' was excluded due to its uniquely large size.
Each data set was randomly split into a test set (10%), validation set (5%) and training set (85%). The complete set of peptides for all data sets, and which split these were part of, are listed in Supplementary Table 5. The validation set is used for model selection only while all performance results presented here were computed from the test set. Prediction performance is measured using three commonly used metrics: MAE, Pearson correlation and Δt 95% . The last describes the error for a retention time window that contains 95% of the peptides in the error distribution. To make the MAE and Δt 95% comparable between experiments, we divided them by the retention time of the difference between the first and last detected peptide in the respective data set. These metrics are further referred to as relative MAE and relative Δt 95% .
Training procedure. All models trained are initialized with random weights from a normal distribution (μ = 0.0 and σ = 1.0). Two NVIDIA Geforce RTX 2080 TIs graphic cards are used for training. The training consisted of 100 epochs with early stopping on a validation set. Most data sets triggered early stopping around 30 epochs, while larger data sets (>75,000 peptidoforms) triggered early stopping at around 50 epochs.

Open search.
Results from an open search by Open-pFind 26 are used to evaluate the ability of DeepLC to identify suspect identifications. Open-pFind allows for open searches by combining an MS 2 tag search and a two-stage (modification restrictive and open) search that is optimized with a Percolator equivalent. Even though Open-pFind uses this sophisticated search strategy open searches are still particularly prone to falsely identify modified peptides. This high false identification rate is due to the larger search space and resulting identification ambiguity 46 . The search was performed on a data set on 17 adult and seven fetal tissues by Kim et al. 32 , obtained from the PRIDE repository 53 with identifier PXD000561. The search was run with Open-pFind v.3.1.5. Search parameters were set to: 20 ppm precursor mass error tolerance, peptide length limit from seven to 30 amino acids, modifications were limited to mass deltas from −150 to 500 Da, a maximum of two miscleavages were allowed, and oxidation of M and carbamidomethylation of C were set as variable modifications.
Observed and predicted retention time error distributions that are substantially different from that of carbamidomethyl are marked with a '*'-symbol. This difference is calculated by subtracting 20 equidistant percentiles from 5 to 100% of both distributions and calculating the summed absolute difference. The 25% error distributions with the largest distance are marked as substantially different.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Data from the following projects were used to train and evaluate DeepLC: HeLa hf 42