Construction of large mass spectrometric data sets usually involves some combination of normalisation, scaling, and transformation of individual mass spectra in order to correct for technical (and sometimes biological) variation. Many different approaches to data normalisation have been reported, and there is no particular consensus on the best approach. The present study systematically evaluates a set of 24 normalisation, scaling, and transformation methods, and their 420 possible combinations, in the context of atmospheric solids analysis probe (ASAP) mass spectra of human blood plasma. The plasma samples came from two separate cohorts of patients, enrolled respectively in the Oxford Acute Myocardial Infarction (OxAMI) and Oxford Abdominal Aortic Aneurysm (OxAAA) clinical studies. Within each cohort, patients are classified according to a number of different clinical variables. We have investigated the effect of normalisation, scaling, and transformation method on subsequent clustering of the data into the classes of interest, and on machine-learning based classification of the data into the categories of interest. The choice of method was found to have a substantial effect on data clustering, measured via the clustering ratio C R , but a much smaller effect on machine-learning based classification, quantified via Cohen’s κ statistic. New intensity-histogram-based normalisation methods were found to have the greatest effect on clustering, while mean, median, vector, and AUC normalisation yielded the best machine-learning classification performance across multiple algorithms. High clustering ratios do not necessarily correlate with improved supervised classification outcomes, underscoring the need to consider subsequent data analysis methodology carefully when optimising data preprocessing pipelines.
3401 Analytical Chemistry
,34 Chemical Sciences
,Machine Learning and Artificial Intelligence