A moment kernel machine for clinical data mining to inform medical decision making

Scientific Reports volume 13, Article number: 10459 (2023) Cite this article

362 Accesses

4 Altmetric

Metrics details

Machine learning-aided medical decision making presents three major challenges: achieving model parsimony, ensuring credible predictions, and providing real-time recommendations with high computational efficiency. In this paper, we formulate medical decision making as a classification problem and develop a moment kernel machine (MKM) to tackle these challenges. The main idea of our approach is to treat the clinical data of each patient as a probability distribution and leverage moment representations of these distributions to build the MKM, which transforms the high-dimensional clinical data to low-dimensional representations while retaining essential information. We then apply this machine to various pre-surgical clinical datasets to predict surgical outcomes and inform medical decision making, which requires significantly less computational power and time for classification while yielding favorable performance compared to existing methods. Moreover, we utilize synthetic datasets to demonstrate that the developed moment-based data mining framework is robust to noise and missing data, and achieves model parsimony giving an efficient way to generate satisfactory predictions to aid personalized medical decision making.

Surgery, as a major medical intervention, is usually considered when other treatments result in unsatisfactory outcomes. Predicting adverse events following surgery based on patients’ presurgical clinical data such as electronic health record (EHR) data is of crucial importance to inform both physicians and patients for decision making1,2. In recent years, the increased availability of clinical data and computing power greatly stimulated the development of machine learning (ML) techniques to extract information from clinical data. In particular, ML algorithms have made significant strides in AI-assisted medical procedures for preoperative prediction of postsurgical outcomes through EHRs3,4. The general ML problem focuses on finding an appropriate function f mapping each input data point \({\textbf{X}}\) to the desired output \({\textbf{y}}\), i.e.,

This task is particularly challenging for datasets containing clinical records with a large size and mixed types of data, including diagnoses, treatments, vital signs, and laboratory values5.

In the past decade, numerous ML-aided methods have been proposed to assist medical decision making through the prediction of postsurgical events. For example, for weight-loss surgery, notable contributions include the application of logistic regression (LR) and Poisson regression (PR) to estimate the readmission rate6, the utilization of neural networks (NNs) and gradient-boosting machines (GBMs) to predict gastrointestinal leak and venous thromboembolism7,8, and the development of the super learner algorithm to predict the risk of 30-day readmission after bariatric surgery9,10. In addition to assessing possible postsurgical events, ML methods have been widely applied to identify abnormalities in medical images such as precancerous or premalignant lesions11,12,13,14. Primary examples range from a deep learning approach to mortality prediction for patients with coronary heart disease and heart failure15 to quantitative image feature extraction methods for the prognosis of early revascularization in patients with suspected coronary artery disease16. Algorithmically, deep neural networks have been attractive to medical researchers and practitioners, due to their ability to discover hidden structures in large datasets, leading to a high probability of achieving satisfactory results under suitable conditions17. Among these works, the integration of ML-techniques into medical research, although successful in many ways, usually suffers from low computational efficiency due to the heterogeneous structure, e.g., due to sparsity and irregularity, and the large size of clinical data18. In general, the complexity of ML algorithms grow exponentially in time and memory usage as a function of data size. Moreover, to produce better performance, deep neural networks further sacrifice robustness to noise and model parsimony, in addition to computational efficiency19.

Aiming to construct a parsimonious and computationally efficient model to assist medical decision making, particularly for surgical treatments, we develop a moment kernel machine for clinical data mining. The main idea is to introduce the notion of moments for clinical data to efficiently characterize patients’ overall health status. We further integrate the Hilbert Schmidt Independence Criterion (HSIC) Lasso method into the data preprocessing procedure. This leads to two major advantages: (1) the moment representation can quantitatively identify the crucial predictors impacting surgery outcomes; and (2) the dimension of the EHR data is significantly reduced, which facilitates its high computational efficiency in data analytic tasks. We then formulate medical decision making problems as ML classification problems, in which the moment representations extracted from EHR data are used as features for ML classifiers. To demonstrate, we choose three ML classifiers, LR, NNs, and GBMs, and used three clinical datasets to illustrate the generalizability of the developed moment kernel machine, that is, making medical decisions based on moments is valid for different clinical data regardless of the choices of the ML classifiers. We compare the classification performance resulting from our method with that from existing feature extraction methods, highlighting the model parsimony and high computational efficiency of the developed moment kernel machine. Furthermore, we demonstrate the robustness of moment kernel machine to noise and data loss using synthetic data.

In this section, we first illustrate how clinical decision making can be assisted through machine learning algorithms to give informed predictions based on the patient’s clinical data. In particular, we formulate this task as a classification problem. Next, as the core of this section, we develop a novel moment kernel to extract features from the clinical data for the classification problem. Our method uniquely integrates the HSIC Lasso algorithm to select informative features, thereby improving computational efficiency without sacrificing classification performance. To demonstrate the applicability of the developed medical decision making machine, we also present case studies using both synthetic data and real clinical data.

Making decisions regarding a major medical intervention for a patient, e.g., deciding whether the patient should have surgery, generally requires (1) collecting sufficient data on possible post-intervention outcomes; (2) monitoring the current health condition and reviewing the medical history of the patient; and (3) assessing the significant factors influencing the possible outcomes based on the patient’s current health condition and medical history. This pipeline closely follows the formulation of a classification task in machine learning, where each class represents one possible post-intervention outcome. Then, the classification outcome obtained by the ML procedure using the patient’s clinical data can inform the medical decision.

In ML, features play a critical role in its performance. To ably assist with the medical decision making process, it is crucial to extract features from clinical data which reflect the patients’ medical conditions as well as ones that affect post-intervention outcomes, which is the main focus of the following sections.

When working with datasets comprising both numerical and categorical values, we apply one-hot encoding20, a technique that converts categorical predictors to non-negative binary values. Each category is given a unique numerical representation as a feature vector consisting of entries of 0 and 1. We then pre-normalize each feature (predictor) within the training and testing datasets to ensure that all of them are in the interval [0, 1] before further normalizing the predictor vector of each patient to a probability distribution. Let \({\textbf{x}}_i=(x_{i1},\dots ,x_{iM})\) denote the predictor vector of the \(i^{\textrm{th}}\) patient for \(i=1,\dots ,N\), where N is the total number of patients and M is the number of predictors for each patient, then we normalize each \({\textbf{x}}_j\) as

which yields a vector \({\textbf{p}}_i=(p_{i1},\dots ,p_{iM})\) satisfying \(\sum _{j=1}^Mp_{ij}=1\). In addition, every component \(p_{ij}\) of \({\textbf{p}}_i\) takes values in the same interval [0, 1], and this resolves any heterogeneity in the data, that is, different components of the predictor vector \({\textbf{x}}_i\) are drawn from different ranges. The property \(\sum _{j=1}^Mp_{ij}=1\) then reveals that \({\textbf{p}}_i\) is a probability vector, and hence each \({\textbf{p}}_i\) represents the probability distribution of some random variable \({\textbf{A}}\). In particular, if \({\textbf{A}}\) takes values on the set \(\Omega =\{\alpha _1,\dots ,\alpha _M\}\) containing M distinct elements, then the probability of the event \(\{{\textbf{A}}=\alpha _j\}\) is given by \({\mathbb {P}}({\textbf{A}}=\alpha _j)=p_{ij}\) for each \(j=1,\dots ,M\).

By the Hausdorff moment problem21, the probability distribution \({\textbf{p}}_i\) of \({\textbf{A}}\) is uniquely determined by the moment vector \({\textbf{m}}_i=(m_{i0},\dots ,m_{i,M-1})\in {\mathbb {R}}^M\), whose \(k^{\textrm{th}}\)-component is given by

and referred to as the \(k^{\textrm{th}}\)-moment of the random variable \({\textbf{A}}\) with respect to the probability distribution \({\textbf{p}}_i\). Computationally, the moment vector \({\textbf{m}}_i\) can be easily obtained by \({\textbf{m}}_i={\textbf{p}}_i{\mathbb {M}}\), where

is the \(M\times M\) Vandermonde matrix generated by the vector \(\alpha =(\alpha _1,\dots ,\alpha _M)\) consisting of the possible values of the random variable \({\textbf{A}}\). The assumption that \(\alpha _i\) are distinct guarantees \(\det ({\mathbb {M}})=\prod _{1\le k< l\le M}(\alpha _l-\alpha _k)\ne 0\), equivalently, the invertibility of \({\mathbb {M}}\). Therefore, as a map from \({\mathbb {R}}^M\) to \({\mathbb {R}}^M\) assigning a probability distribution to a moment vector, \({\mathbb {M}}\) is bijective, i.e., different probability distributions must associate with different moment vectors, which also verifies the aforementioned Hausdorff moment problem that \({\textbf{p}}_i\) is uniquely determined by \({\textbf{m}}_i\) and vice versa from the perspective of linear algebra. Together with the fact that \({\textbf{p}}_i\) is the normalization of the predictor vector \({\textbf{x}}_i\) containing the medical records of the \(i^{\textrm{th}}\) patient, this observation firmly declares the candidacy of the moment vector \({\textbf{m}}_i\) as the feature for medical decision making tasks.

Moreover, because the normalization procedure illustrated in (2) endows each \({\textbf{p}}_i\) with the property \(\sum _{j=1}^Mp_{ij}=1\), if \(M-1\) components of \({\textbf{p}}_i\) are known to us, say the first \(M-1\) components \(p_{i1}\), \(\dots\), \(p_{i,M-1}\), then the remaining component can be explicitly calculated as \(p_{iM}=1-\sum _{j=1}^{M-1}p_{ij}\), i.e., the freedom to determine an M-dimensional probability vector is of degree \(M-1\). Consequently, the normalized clinical dataset \(\{{\textbf{p}}_1,\dots ,{\textbf{p}}_N\}\) lies on a proper subspace of \({\mathbb {R}}^M\) of dimension at most \(M-1\). Then, in practice, the use of moments up to some order \(M'<M-1\) may be sufficient to make a strategic medical decision. In this case, the moment kernel as defined in (4) becomes an M-by-\(M'\) matrix, which transforms high-dimensional predictor vectors to low-dimensional moment vectors while retaining all the information required for making a medical decision at a lower computational cost.

In addition, the moment kernel in (4) is independent of the data, and hence \({\textbf{A}}\) is a dummy random variable so that the sample space \(\Omega\), containing the M outcomes of \({\textbf{A}}\), is completely free to choose. To inform strategic medical decisions, we seek a construction of \(\Omega\) suitable for ranking the relative contribution of feature vectors, which we discuss in the next section.

Because each moment \(m_{ik}\) in (3) is a weighted sum of the normalized predictors \(p_{ij}\) with the weights \(\alpha _j^k\), the choice of the sample space \(\Omega =\{\alpha _1,\dots ,\alpha _M\}\) boils down to the determination of the weights. Naturally, predictors with larger weights acknowledge greater importance to the decision making process.

To this end, we formulate the task of searching for \(\Omega\) as a feature importance ranking (FIR) problem22, to assign larger weights to more informative predictors. In particular, we will apply the Hilbert Schmidt Independence Criterion (HSIC) Lasso algorithm formulated as23

where \(\Vert \cdot \Vert _{\text {Frob}}\) is the Frobenius norm of matrices, i.e., \(\Vert {\textbf{A}}\Vert _{\text {Frob}}=\sqrt{\sum _{i=1}^m\sum _{j=1}^nA_{ij}^2}\) for any \({\textbf{A}}\in {\mathbb {R}}^{m\times n}\) with the (i, j)-entry \(A_{ij}\), and \(\Vert \alpha \Vert _1=\sum _{i=1}^M|\alpha _i|\) is the \(\ell ^1\)-norm of the vector \(\alpha =(\alpha _1,\dots ,\alpha _M)\), and \(\lambda >0\) is a constant controlling the sparsity of the solution. Moreover, \(\bar{{\textbf{K}}}^{(j)} = \Gamma {\textbf{K}}^{(j)} \Gamma\) and \(\bar{{\textbf{L}}} = \Gamma {\textbf{L}} \Gamma\) are centered Gram matrices with the entries \({\textbf{K}}_{m,n}^{(j)} = k(p_{j,m},p_{j,n})\) and \({\textbf{L}}_{m,n} = l(y_m,y_n)\) defined by using some kernel functions k and l, where \(y_i\) denotes the class label of the \(i^{\textrm{th}}\) patient and \(\Gamma = {\textbf{I}}_N - \frac{1}{N} {\textbf{1}}_N {\textbf{1}}^{\top }_N\) is the centering matrix. Moreover, for memory and computational efficiency, we use Block HSIC Lasso24 in our experiments.

The pipeline of our feature extraction framework through moments is summarized in Fig. 1. The following case studies further show that the performance of the classification using the moment vectors \({\textbf{m}}_i\), generated from the moment kernel in (4), as features is comparable to those using other features with reduced computation time and increased robustness.

The pipeline for our feature extraction method through moments. HSIC Lasso is applied to the clinical data to obtain feature importance score (weights) for each feature. The weights are then used to form the moment kernel defined in (4). The efficient representation \({\textbf{M}}\) of the original clinical data is then generated through the moment kernel operation. \({\textbf{M}}\) will then be used in three machine learning algorithms: logistic regression (LR), neural networks (NNs), and gradient boosting machines (GBMs). The prediction of the machine learning algorithms further informs medical decision making.

We present three real-world datasets, including making informed decisions about breast cancer surgery, weight loss surgery, and liver transplant surgery by using pre-surgery patients’ clinical data. Specifically, we use the breast cancer dataset from the UCI Machine Learning Repository25, which is a publicly available database with open access, for classification of breast cancer recurrence events. The Metabolic Bariatric Surgery Accreditation and Quality Improvement Program (MBSAQIP) dataset26 is used for classification of the incidence of catastrophic events, including death, unplanned admission to ICU, and at least one re-operation within 30 days after weight loss surgery, and the Organ Procurement and Transplantation Network (OPTN) dataset27 is used for classification of graft failure following a liver transplant surgery. Both MBSAQIP and OPTN datasets are publicly available and accessible upon request. The three datasets are briefly summarized in Fig. 2, and a more detailed summary of each dataset is included in the Supplemental Material.

All procedures included in this work were performed in accordance with the relevant regulations and guidelines, and all informed consents were obtained before the admission to respective medical institute took place. The breast cancer dataset was collected from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. The MBSAQIP dataset is a Health Insurance Portability and Accountability Act (HIPAA)-compliant data file containing cases submitted to the MBSAQIP Data Registry, which contains patient-level, aggregate data and does not identify hospitals, health care providers, or patients. The OPTN dataset is collected via an online Web application. Transplant professionals from hospitals, histocompatibility (tissue typing) laboratories, and organ procurement organizations located across the United States use the application to manage their list of waiting transplant candidates, access and complete electronic data collection forms, add donor information and run donor-recipient matching lists, access various transplant data reports and policies. No organs/tissues were procured from prisoners.

For each dataset, we use an 80% training and 20% testing split, and explore three types of feature engineering schemes \({\textbf{M}}\), \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\) for classification, where \({\textbf{M}}\) contains the features generated by the moment kernel in (4), \({\textbf{X}}\) presents the preprocessed data (obtained by normalization and one-hot encoding), and \({\textbf{X}}(\alpha )\) consists of the features in \({\textbf{X}}\) after feature selection. Each set of features are used to train three classifiers, including logistic regression (LR), artificial neural networks (NNs), and gradient-boosting machines (GBMs), and we examine the computation time and the area under the Receiving Operator Characteristic (AUC) curve for the testing data.

As shown in Fig. 2, all three datasets are imbalanced. Imbalanced datasets contain significantly uneven class labels. To address the issue of data imbalance, we adopt the following accommodations in the training phase. For LR, observation weights are added according to the ratio of class-imbalance; for NNs, an error weight based on class distribution is added to punish the misclassification of the minority label; for GBMs, we use RUSBoost28, a boosting method well-known for its robustness to class imbalance, to learn from the skewed training data. In the testing phase, the AUC is naturally immune to class imbalance so that the classification performance accurately reflects the performance of the classifiers.

Finally, we also test robustness to number of samples, features, noise and missing values for each preprocessing scheme using five experiments : (a) noise-free data, (b) data with signal-to-noise ratio \(SNR = 20\), (c)-(e) missing data in significant features, in which experimental data are synthetically generated. In each of these experiments, we have 10,000 samples (\(N=10,000\)) and 2,000 predictors (\(M=2,000\)), and among the 2,000 predictors only 5 of them are causal; namely, only five of them really contribute to the output labels, and the other 1,995 features are randomly generated, which are independent of the output labels. Causal features are generated by two Gaussian distributions with mean \((\mu _{i1},\mu _{i2}) = (0.3i, 0.7i), i = 1, \ldots , 5\) and the same variance \(\sigma ^2 = 1\), representing data from two different classes.

The order of moments is set to be a tunable hyper-parameter, which, in this work, is chosen to optimize the AUC via cross-validation for each of the ML algorithms. Note that increasing the number of moments does not necessarily improve classification performance, as more moments (features) used may lead to a higher chance of overfitting. To illustrate this idea, in Fig. 5, we use another synthetic dataset with 100 observations and 100 predictors. The AUC results are given using \({\textbf{M}}\) with the order of moment M taken up to 20.

All the case studies were executed using MATLAB on a Windows 10 operating system with i5-7600K 3.80 GHz CPU and 16 GBM RAM memory.

Summary description of the datasets.

The AUC results for three real world data are shown in Fig. 3, along with a summary of the results and comparison to other published works in Table 1. We determine the number of features required for \({\textbf{M}}\) by optimizing the AUC via cross-validation. For the breast cancer dataset, the percentage of training time saved using \({\textbf{M}}\) compared to \({\textbf{X}}\) is the highest for LR (64%), followed by NN (51%) and GBM (19%). Our preprocessing scheme \({\textbf{M}}\) is also more efficient than \(\mathbf {X(\alpha )}\). We observe that across all 3 algorithms, \({\textbf{M}}\) generated AUC scores of 0.75, 0.65, and 0.70, for the LR, NN, and GBM models which were as high or higher than the next best performing method \({\textbf{X}}(\alpha )\). Using \({\textbf{M}}\) also required only 5 features, compared to 38 and 25 for \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\). Compared to a published work29 that gives an highest accuracy of 72.7% using the C4.5 decision trees, which is based on the information gain of the raw features to split the decision trees, our method gives an improved accuracy of 75%..

For the MBSAQIP dataset, the time required to train using \({\textbf{M}}\) is substantially lower than both \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\). The percentage of time saved for \({\textbf{M}}\) compared to \({\textbf{X}}\) is 99% for LR, 91% for NN, and 31.4% for GBM. We also observe comparable AUC testing performance using \({\textbf{M}}\). Using LR as the ML model, the AUC when using \({\textbf{M}}\) is second-highest, compared to 0.75 when using \({\textbf{X}}(\alpha )\). For NN and GBM, the AUC when using \({\textbf{M}}\) is slightly lower compared to \({\textbf{X}}(\alpha )\) (0.63 and 0.70 compared to 0.68 and 0.75), but the testing accuracy is nearly the same (75% and 74% compared to 82% and 84%). We also observe that the preprocessed data \({\textbf{M}}\) only used 30 features, compared to 163 and 131 for \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\). In the previous work30, the authors integrated multiple machine learning models for the classification task. In essence, the ensemble of models is also a decision-tree-based algorithm, where enhanced features resulting from these machine learning algorithms are combined into an ensemble.

Lastly, for the liver transplant dataset, the percentage of time saved using \({\textbf{M}}\) is the most for LR (98%), followed by NN (59%) and then GBM (17%). Across all three ML algorithms, we observe similar AUC for each preprocessing scheme, where \({\textbf{M}}\) generates AUC scores of 0.63, 0.59, and 0.63 for LR, NN, and GBM, compared to the highest AUCs found with other preprocessing schemes of 0.66, 0.60, and 0.65 for the same models. Similarly to the other datasets, using \({\textbf{M}}\) as the preprocessing scheme required only 20 features compared to 99 and 91 features for \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\). In a published work31 using a deep neural network on pre-surgical data, the authors included 202 features, even after feature selection.

To further compare the classification performance of the proposed MKM framework with other feature selection methods, in Table 2, we demonstrate the results of AUC performance obtained by applying several widely-used feature selection methods, including Chi-Square test32, minimum redundancy maximum relevance (MRMR)33, neighborhood component analysis (NCA)34, correlation-based feature selection (CFS), and BorutaShap35. In the table, M features with the highest importance weights are retained to construct \({\textbf{X}}(\alpha )\) using different feature selection methods. We observe that \({\textbf{M}}\) remains competitive among all other feature selection methods in terms of both classification performance and model parsimony.

In addition to \({\textbf{X}}(\alpha )\), we also compare the classification results obtained by using \({\textbf{M}}\) generated by the aforementioned feature selection methods. Similarly, we keep 5, 20, and 30 moments for the breast cancer, liver transplant, and MBSAQIP datasets, respectively, and perform 10-fold cross-validation for testing. The results are shown in Fig. 1 of the Supplemental Material, from which we observe that in most of the cases, different feature selection methods for generating \({\textbf{M}}\) result in similar performances, and the proposed HSIC has a slight advantage over the other feature selection methods. This illustrates the robustness of the feature extraction framework through the notion of moments. The extracted \({\textbf{M}}\) remains competitive in both model parsimony and classification performance regardless of the feature selection methods used.

ROC curve for the classification task using (1) \({\textbf{X}}\): the original dataset, (2) \({\textbf{M}}\): the reduced dataset, and (3) \({\textbf{X}}(\alpha )\): the dataset that contains only features identified by HSIC Lasso. The proposed method is capable of selecting non-redundant features that offset the influence of noise.

In all of the analyses using the synthetic dataset, we chose HSIC Lasso as our feature selection method to investigate the computation efficiency of the MKM framework for its robustness in performances across different real-world datasets. As summarized in Table 3, for our synthetic dataset with \(N=10{,}000\) observations and \(M=2,000\) predictors, in all of the scenarios studied, using \({\textbf{M}}\) consistently outperformed \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\) in both the AUC score and the time required for training. The confidence intervals are also tighter for the \({\textbf{M}}\) cases.

The synthetic dataset’s runtime performance is summarized in Fig. 4. Figure 4 panel (a) shows the memory usage (in MB) for each preprocessing scheme. Overall, using \({\textbf{M}}\) performs as well or better than \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\) for both runtime and memory usage. Moreover, as shown in Fig. 4b, with the number of samples increasing from 100 to 10,000, using \({\textbf{M}}\) not only guarantees the shortest running time but also almost keeps the running time from increasing. The situation remains the same for the case of increasing the number of features as illustrated in Fig. 4c.

The effect of moment order on the classification performance is summarized in Fig. 5. We observe that moments up to order 5 generate the best classification performance while adding orders higher than 5 gives sub-optimal performances. In this example, higher orders are redundant for approximating the original distribution of the datasets and result in overfitting.

(a) Computational resources required by \({\textbf{X}}\), \({\textbf{X}}(\alpha )\) and \({\textbf{M}}\) for 10 bootstraps. (Left) Time elapsed during the training and test process. (Right) Memory used under a single-core scenario. (b) Runtime versus the number of samples using \({\textbf{X}}\), \({\textbf{X}}(\alpha )\) and \({\textbf{M}}\). The mean AUC score is calculated through averaging all of the bootstrapping cases. The zoomed-in figure shows only the case \({\textbf{M}}\) and \({\textbf{X}}(\alpha )\), from where we can still observe that runtime of \({\textbf{M}}\) does not increase with the sample size. (c) Runtime versus the number of features using \({\textbf{X}}\), \({\textbf{X}}(\alpha )\) and \({\textbf{M}}\). The mean AUC score is calculated through averaging all of the bootstrap cases. The zoomed-in figure shows \({\textbf{M}}\) and \({\textbf{X}}(\alpha )\). They are both immune to increasing features, but clearly \({\textbf{M}}\) has an edge over \({\textbf{X}}(\alpha )\) in runtime.

Number of moment terms v.s. AUC performance with 95% confidence interval for the synthetic dataset. In particular, 5-order moments give the optimal AUC performance of 0.9997, and increasing the order of moments order compromises the AUC performance due to overfitting.

We present a feature extraction framework utilizing the notion of moments to construct a low-dimensional representation of the original high-dimensional dataset. The strength of this framework lies in its time and memory efficiency over the traditional HSIC Lasso method that is widely used for feature selection from large datasets. Through the representation of a moment kernel, in which the zero-weighting of the unimportant features and enlarging the useful ones are combined in one simple operation, we enhance the resistance of ML algorithms to noise and missing data, both of which are common problems encountered when we tackle real-world datasets. The training time for an ML model can be reduced by up to 99%, depending on the model used, while model performance is not compromised.

It is common to treat each medical predictor as a random variable for leveraging its statistical properties, e.g., the probability distribution, analyzed over the sample space of the target population, e.g., patients diagnosed as suffering from some specific diseases, to assist medical decision making36,37. In our framework, the roles of patients and predictors are reversed: we formulate each patient as a probability distribution over the sample space consisting of the medical predictors. As a result, the analysis becomes patient-centered, and more importantly, our framework then gives rise to a patient-first and personalized medical decision making workflow.

The decisive factors affecting medical decisions in our model are the moment vectors output from the moment-kernel in (4). As defined in (3), moment vectors are obtained by taking expectations, i.e., averages, with respect to the probability distributions representing patients’ medical records. This indicates that decisions made by our machine are based on patients’ over health conditions by comprehensively evaluating all the medical predictors. Moreover, the integration of FIR in the feature extraction (moment vector computation) procedure further increases the influence of a small number of crucial predictors in making medical decisions. From the perspective of ML theory, this is effective to avoid overfitting so that our algorithms are expected to be more generalizable, which we explain in more detail in the following section.

Our results demonstrate that using the moment kernel requires significantly fewer features to achieve comparable or better performance than other feature extraction methods. This aptly demonstrates the effectiveness of the moments to represent the structures of the original datasets, the main reason for which is the averaging nature of the moments. Recall the definition in (3), each moment is a weighted sum of the normalized predictors of a patient, and hence depends on all the data collected for the patient, indicating its ability to reflect the overall structure of the data. On the other hand, as validated by the Hausdorff moment problem mentioned in the Methods section21, the collection of all the moments is also able to characterize the data in a comprehensive way, not only in an averaging way. As a result, a small number of the moments that document enough information for the classification task is already sufficient. In addition, the utilization of HSIC Lasso further outputs a sparse vector \(\alpha\), which successfully reduces the dimension of the moment kernel. These characteristics make the developed moment kernel-based learning framework a parsimonious model with great computational efficiency and generalizability as discussed below.

Across all the case studies, the performance of the classifications using moment features is either comparable or outperforms those using other features, regardless of the choices of classifiers. In part, the robustness of our method across different datasets is the manifestation of its generalizability in ML terminology. In particular, deep NNs are known famous for the extraordinary power to derive conclusions from complex datasets even without any feature selection, that is, the preprocessing scheme \({\textbf{X}}\) in our notation. However, for all three datasets, the vanilla LR with \({\textbf{M}}\) can outperform NN. Together with the robustness to different classification algorithms, the developed moment method has the potential to serve as a universal ML-assisted medical decision making workflow.

Another significant advantage of our method is the robustness to noise and data loss. As illustrated in Table 3, with the degree of data loss increasing and the effect of noise coming in, although the performance drops, using \({\textbf{M}}\) consistently produces the best results compared to \({\textbf{X}}\) and \({\textbf{X}}(\alpha )\). These are also due to the averaging operation in the computation of moments, which smooths out the data, and hence is able to neutralize the effect of noise and compensate for missing values.

Unarguably, the most significant advantage of utilizing moment features is the high computational efficiency. This is also mainly due to the parsimony of the moment-based model. In most of the cases in the case studies, by using moments as features, the model training time are dramatically reduced, e.g., for the MBSAQIP and liver transplant datasets, the time to train LR are reduced by 99% and 98%, respectively.

On the other hand, for the synthetic data, in addition to having a shorter running time comparable with \({\textbf{X}}(\alpha )\), \({\textbf{M}}\) also achieves the lowest memory consumption. Moreover, we also observe that the rates of increase in running time for \({\textbf{M}}\), as well as \({\textbf{X}}(\alpha )\), are remarkably small with the numbers of samples and features increasing, and \({\textbf{M}}\) persists to be the model with the shortest running time. In particular, for \({\textbf{M}}\) with respect to the number of dependent features, the increase of runtime is almost 0 even with the number of casual features having increased by a factor of 100. In summary, using moment features for learning tasks will not suffer from computational burden, which further claims the suitability of \({\textbf{M}}\) for tackling large-scale complex medical datasets.

Note that because each moment is a linear combination of the weights \(\alpha _i\) output from the importance ranking algorithm HSIC Lasso, the importance of the predictors to the medical decisions may not be directly recognizable from the moments, which constitutes a possible limitation of the proposed moment kernel machine.

We develop a moment kernel machine to extract features for predicting surgical outcomes using existing clinical data to inform decision making. The kernel is constructed through the notion of moments, which is capable of transforming complicated clinical data to compact and meaningful representations while retaining information crucial to medical decision making. In particular, the developed machine not only provides informative predictions for medical decision-making, but also is preferable to existing methods in terms of computational efficiency, model parsimony, and robustness to noise. Finally, this moment kernel machine has the potential to be personalized based on the specific requirements of patients and physicians, which is a significant development in ML-aided decision making methods in medicine.

O’Donnell, F.T. Preoperative evaluation of the surgical patient (2016).

King, M.S. Preoperative evaluation (2000).

Xue, B. et al. Use of machine learning to develop and evaluate models using preoperative and intraoperative data to identify risks of postoperative complications. JAMA Netw. Open 4, e212240–e212240. https://doi.org/10.1001/jamanetworkopen.2021.2240 (2021).

Article PubMed PubMed Central Google Scholar

Chiew, C.J., Liu, N., Wong, T.H., Sim, Y.E. & Abdullah, H.R. Utilizing machine learning methods for preoperative prediction of postsurgical mortality and intensive care unit admission (2020).

Wu, J., Roy, J. & Stewart, W. F. Prediction modeling using EHR data. Medical Care 48, S106–S113. https://doi.org/10.1097/mlr.0b013e3181de9e17 (2010).

Article PubMed Google Scholar

Abraham, C. R. et al. Predictors of hospital readmission after bariatric surgery. J. Am. Coll. Surg. 221, 220–227. https://doi.org/10.1016/j.jamcollsurg.2015.02.018 (2015).

Article PubMed Google Scholar

Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794, https://doi.org/10.1145/2939672.2939785 (ACM, New York, NY, USA, 2016).

J, N. et al. Development and validation of machine learning models to predict gastrointestinal leak and venous thromboembolism after weight loss surgery: An analysis of the mbsaqip database. Surg. Endosc. 35, 182–191. https://doi.org/10.1007/s00464-020-07378-x (2021).

Article Google Scholar

van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genetics Mol. Biol.https://doi.org/10.2202/1544-6115.1309 (2007).

Article MATH MathSciNet Google Scholar

Torquati, M. et al. Using the super learner algorithm to predict risk of 30-day readmission after bariatric surgery in the united states. Surgeryhttps://doi.org/10.1016/j.surg.2021.06.019 (2021).

Article PubMed Google Scholar

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88. https://doi.org/10.1016/j.media.2017.07.005 (2017).

Article PubMed Google Scholar

Schwyzer, M. et al. Automated detection of lung cancer at ultralow dose PET/CT by deep neural networks—Initial results. Lung Cancer 126, 170–173. https://doi.org/10.1016/j.lungcan.2018.11.001 (2018).

Article PubMed Google Scholar

Tajbakhsh, N. et al. Convolutional neural networks for medical image analysis: Full training or fine tuning?. IEEE Trans. Med. Imaging 35, 1299–1312. https://doi.org/10.1109/tmi.2016.2535302 (2016).

Article PubMed Google Scholar

Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nature Med. 26, 900–908. https://doi.org/10.1038/s41591-020-0842-3 (2020).

Article ADS CAS PubMed Google Scholar

Kwon, J.-M., Kim, K.-H., Jeon, K.-H. & Park, J. Deep learning for predicting in-hospital mortality among heart disease patients based on echocardiography. Echocardiography 36, 213–218. https://doi.org/10.1111/echo.14220 (2019).

Article PubMed Google Scholar

Arsanjani, R. et al. Prediction of revascularization after myocardial perfusion spect by machine learning in a large population. J. Nucl. Cardiol. 22, 877–884. https://doi.org/10.1007/s12350-014-0027-x (2015).

Article PubMed Google Scholar

Xue, Y., Du, N., Mottram, A., Seneviratne, M. & Dai, A. M. Learning to select best forecast tasks for clinical outcome prediction. In Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 15031–15041 (Curran Associates, Inc., Berlin, 2020).

Google Scholar

Ross, M. K., Wei, W. & Ohno-Machado, L. “big data’’ and the electronic health record. Yearb. Med. Inform. 23, 97–104. https://doi.org/10.15265/iy-2014-0003 (2014).

Article Google Scholar

Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. 25, 1419–1428. https://doi.org/10.1093/jamia/ocy068 (2018).

Article PubMed PubMed Central Google Scholar

Guo, C. & Berkhahn, F. Entity embeddings of categorical variables (2016). arXiv:1604.06737.

Hausdorff, F. Momentprobleme für ein endliches intervall. Math. Z. 16, 220–248 (1923).

Article MATH MathSciNet Google Scholar

Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003).

MATH Google Scholar

Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P. & Sugiyama, M. High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput. 26, 185–207. https://doi.org/10.1162/neco_a_00537 (2014).

Article PubMed MATH MathSciNet Google Scholar

Climente-González, H., Azencott, C.-A., Kaski, S. & Yamada, M. Block HSIC Lasso: Model-free biomarker detection for ultra-high dimensional data. Bioinformatics 35, i427–i435. https://doi.org/10.1093/bioinformatics/btz333 (2019).

Article CAS PubMed PubMed Central Google Scholar

Dua, D. & Graff, C. UCI machine learning repository (2017).

The metabolic and bariatric surgery accreditation and quality improvement program (2017).

Organ procurement and transplantation network. simultaneous liver-kidney allocation 2016. (2016).

Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J. & Napolitano, A. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40, 185–197. https://doi.org/10.1109/TSMCA.2009.2029559 (2010).

Article Google Scholar

Esmeir, S. & Markovitch, S. Lookahead-based algorithms for anytime induction of decision trees. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 33, https://doi.org/10.1145/1015330.1015373 (Association for Computing Machinery, New York, NY, USA, 2004).

Torquati, M. et al. Using the super learner algorithm to predict risk of 30-day readmission after bariatric surgery in the united states. Surgeryhttps://doi.org/10.1016/j.surg.2021.06.019 (2021).

Article PubMed Google Scholar

Ershoff, B. D. et al. Training and validation of deep neural networks for the prediction of 90-day post-liver transplant mortality using unos registry data. Transpl. Proc. 52, 246–258. https://doi.org/10.1016/j.transproceed.2019.10.019 (2020).

Article Google Scholar

Pearson, K. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 50, 157–175. https://doi.org/10.1080/14786440009463897 (1900).

Article MATH Google Scholar

Ding, C. & Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 03, 185–205. https://doi.org/10.1142/S0219720005001004 (2005).

Article CAS Google Scholar

Yang, W., Wang, K. & Zuo, W. Neighborhood component feature selection for high-dimensional data. JCP 7, 161–168. https://doi.org/10.4304/jcp.7.1.161-168 (2012).

Article Google Scholar

Keany, E. Borutashap: A wrapper feature selection method which combines the Boruta feature selection algorithm with shapley values. Zenodohttps://doi.org/10.5281/zenodo.4247618 (2020).

Article Google Scholar

Vellido, A., Lisboa, P. J. & Vicente, D. Robust analysis of mrs brain tumour data using t-gtm. Neurocomputing 69, 754–768. https://doi.org/10.1016/j.neucom.2005.12.005 (2006). New Issues in Neurocomputing: 13th European Symposium on Artificial Neural Networks.

Christopher, J. J., Nehemiah, H. K., Arputharaj, K. & Moses, G. L. Computer-assisted medical decision-making system for diagnosis of urticaria. MDM Policy & Practice 1, 2381468316677752. https://doi.org/10.1177/2381468316677752 (2016). PMID: 30288410.

Download references

This collaboration was supported by the NIH grants R01 CA253475, U01 CA265735, and R21 DK110530.

Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO, 63130, USA

Yao-Chi Yu, Wei Zhang & Jr-Shin Li

Division of Computational and Data Sciences, Washington University in St. Louis, St. Louis, MO, 63130, USA

David O’Gara & Jr-Shin Li

Division of Biology and Biomedical Sciences, Washington University in St. Louis, St. Louis, MO, 63130, USA

Jr-Shin Li

Division of Public Health Sciences, Department of Surgery, Washington University School of Medicine, St. Louis, MO, 63110, USA

Su-Hsin Chang

You can also search for this author in PubMed Google Scholar

J.-S.L. and S.-H.C. designed the project. Y.-C.Y., W.Z., and J.-S.L. developed the Moment Kernel Machine methods; Y.-C.Y., D.O.G., and W.Z. analyzed the data; Y.-C.Y conducted the numerical experiments and created figures; Y.-C.Y., W.Z., D.O.G., J.-S.L. and S.-H.C. wrote the main manuscript text. S.-H.C. obtained funding and provided statistical expertise; S.-H.C. and J.-S. L. critically reviewed and revised the manuscript, and supervised the study.

Correspondence to Jr-Shin Li or Su-Hsin Chang.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Yu, YC., Zhang, W., O’Gara, D. et al. A moment kernel machine for clinical data mining to inform medical decision making. Sci Rep 13, 10459 (2023). https://doi.org/10.1038/s41598-023-36752-7

Download citation

Received: 29 October 2022

Accepted: 09 June 2023

Published: 28 June 2023

DOI: https://doi.org/10.1038/s41598-023-36752-7

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.