Recruitment of participants
From the electronic medical records system of the Rheumatology and Immunology Department at the Second Hospital of Shanxi Medical University, we retrieved data on lupus nephritis patients who were treated between July 2015 and November 2016. Basic information about study subjects and initial laboratory test results for all patients were collected into the medical record system within 72 h of admission. Experienced rheumatologists screened eligible patients according to study criteria and meticulously reviewed their medical records. A total of 183 patients were analyzed, including 160 females and 23 males. During data collection, detailed records of patients’ demographics, clinical symptoms, laboratory test results, sites of infection, medication history, etc., were documented by physicians (Supplementary Table 1). If any examination results were missing from patients’ records, research assistants contacted the hospital laboratory to retrieve the data, ensuring data integrity and accuracy. All patients met the American College of Rheumatology (ACR) and European College of Rheumatology/European Association for the Research of Renal Diseases (EULAR/ERA-EDTA) guideline criteria18,19. Additionally, we recruited 206 healthy individuals from the Physical Examination Center of the Second Hospital, matched by age and sex, to serve as the healthy control (HCs) group. A rheumatology immunologist with extensive clinical experience recorded detailed patient information, including clinical data, infection site, medication status, etc. The exclusion criteria for the study included patients with incomplete clinical data (missing data for characteristics account for more than 20% of the total sample), individuals under the age of 18, patients diagnosed with other connective tissue diseases, individuals with malignant tumors, immunodeficiency, or severe cardiopulmonary insufficiency, patients with a history of drug allergies or mental illness, patients who have recently undergone digestive endoscopy or surgery, and pregnant or lactating women. All participants in the study provided their informed consent, and the Clinical Research Ethics Committee of the Second Hospital of Shanxi Medical College (Taiyuan, China, 2017-KY-004) approved the study. All methods were performed in accordance with the relevant guidelines.
Infection defined
We use various methods to determine whether an infectious disease is caused by bacteria or viruses. These methods include reviewing the patient’s medical history, conducting a physical examination, and performing ancillary examinations. We confirm the presence of infection through positive pathogen tests or by identifying conclusive evidence of infection, such as an abscess found in a computed tomography scan, based on various specimens like blood, sputum, pus, stool, and urine. Furthermore, we consider a fever (body temperature exceeding 38.0 °C) as an infection if it lasts for at least 3 days and is effectively reversed after anti-infective treatment. However, we do not record the presence of infection if there is no evidence to support it, or if there is doubt about the cause of current symptoms.
The method of model establishment
We used the Python programming language (Python Software Foundation, version 3.6) for data analysis. During the analysis process, we used 8 ML algorithms and used the training group to build corresponding models, and then verified the results in the test group. Select variables for predicting LN infection in the training group and train ML models, including logistic regression (LR), decision tree (DT), k-nearest neighbor (KNN), support vector machine (SVM), Multilayer Perceptron (MLP), Random Forest (RF), Ada Boosting (Ada) and Extreme Gradient Boosting (XGB) (Supplementary Table 2). Initially, the independent variables were standardized to ensure they were measured on a consistent scale, while missing data were imputed through multiple imputation techniques. Additionally, we manually tuned the parameters of each model. The samples are randomly divided into training group and test group, model training is performed on the training group, and model verification is performed on the test group (Supplementary Tables 3 and 4). To select a subset of features to obtain the smallest size and optimal performance, we employ the Random Forest-based Sequential Forward Selection algorithm. The algorithm evaluates model performance (F1_score) by adding one feature at a time to a subset of features and iteratively generating a new model. F1_score is a comprehensive evaluation index of precision and recall. A higher F1_score signifies greater robustness of the model. When the F1_score of the feature subset reaches the optimal value, the iteration is stopped and the feature subset with the smallest size and optimal performance is selected. The Scikit-learn package (Scikit Learning (https://github.com/scikit-learn/scikit-learn)) was used for ML20. The data processing and model establishment workflow is visually presented in Supplementary Fig. 1.
Statistical analysis and model evaluation
To evaluate the prediction model, we used the confusion matrix performance metric to measure the effectiveness of the model and visualized the confusion matrix through the Matplotlib package. To evaluate the performance of the prediction model, we compared multiple evaluation metrics, including the area under the receiver operating characteristic (ROC) curve (AUC), accuracy, recall, precision, and F1 score. The specific evaluation index formula is shown in Supplementary Table 5. An effective model should achieve good performance on both the training group and the test group. The closer the ROC curve is to the upper left corner, the more representative the model is, that is, the AUC is close to 1. Finally, through comprehensive performance comparison of these evaluation criteria, we identified the best model for predicting LN co-infection or not. Statistical analysis used SPSS 26.0 software. The categorical demographic characteristics of patients were compared using the χ test. When continuous data satisfy normality and homogeneity of variances, they are expressed as mean (± standard deviation). The independent sample t-test was employed to compare two groups, while one-way analysis of variance (ANOVA) was used to compare multiple groups. For data that met normality or homogeneity of variance, the median (range) was used to express the data, and the Mann–Whitney U test was used for comparison between groups. Correlation analysis used spearman correlation test. All statistical tests were conducted by bilateral test, and P < 0.05 was considered statistically significant.
Ethics statement
The studies involving human participants were reviewed and approved by the Clinical Research Ethics Committee at the Second Hospital of Shanxi Medical College and West China Hospital College. All methods were performed in accordance with the relevant guidelines.
- The Renal Warrior Project. Join Now
- Source: https://www.nature.com/articles/s41598-024-59717-w