Aim
This study aimed to develop an efficient and cost-saving diagnostic approach using natural language processing and explainable machine learning models.
Subject and methods
11,863 Influenza-like illness cases from Huzhou City, China for four common respiratory viruses were collected: SARS-CoV-2, influenza, respiratory syncytial virus, and adenovirus. Natural language processing techniques were employed to extract and normalize symptom features from unstructured clinical text. Five machine learning algorithms were evaluated using AUC, accuracy, sensitivity, and specificity to select the best-performing model. Subgroup analyses by age, sex, and fever status assessed model robustness, and SHAP values were calculated for interpretability.
Results
Compared with existing diagnostic tools, our model demonstrated higher accuracy and better predictive performance, with AUCs of 0.856 (95% CI: 0.830-0.881) for SARS-CoV-2, 0.737 (95% CI: 0.713-0.760) for Influenza, 0.801 (95% CI: 0.744-0.857) for RSV, and 0.782 (95% CI: 0.748-0.816) for adenovirus, showing particularly high capability for SARS-CoV-2 and RSV. Subgroup analyses showed particularly excellent discriminative accuracy in pediatric or afebrile patients.
Conclusions
This study demonstrates the feasibility of integrating natural language processing and machine learning techniques for identification of respiratory viruses based solely on symptoms, and offers a low-cost and efficient alternative to PCR testing, which can reduce reliance on resource-intensive testing and enhance early detection in clinical practice. This approach can support early screening and resource allocation in both clinical and public health settings.