AI’s ability to detect COVID-19 from coughs faces real-world challenges

4 minutes, 37 seconds Read


recent Nature is machine intelligence The study investigated the performance of an audio-based artificial intelligence (AI) classifier that predicted severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infection status. SARS-CoV-2 is the cause of the coronavirus disease 2019 (COVID-19) pandemic.

Study: Audio-based AI classifiers show no evidence of improved COVID-19 screening compared to simple symptom checkers.  Image credit: Aliaksandra Post / ShutterstockStudy: Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptom checkers. Image credit: Aliaksandra Post / Shutterstock

Background

Since SARS-CoV-2 infection can cause both symptomatic and asymptomatic manifestations, it is important to develop accurate tests to avoid quarantine of the general population. Previous studies have revealed that AI-based classifiers trained with respiratory audio data can detect SARS-CoV-2 status.

Although these studies indicate the effectiveness of AI-based classifiers, many challenges arise when applying them in real-world settings. Some of the factors that held back AI-based classifier applications were sampling bias, unverified data on participants’ COVID-19 status, and the delay between transmission and audio recording. It is imperative to determine whether audio biomarkers of COVID-19 are unique to SARS-CoV-2 infection or are inappropriate confounding signals.

About the study

The current study focuses on determining whether audio-based classifiers can be accurately used for COVID-19 screening. A large-scale polymerase chain reaction (PCR) dataset associated with audio-based COVID-19 screening (ABCS) was used. For this study, participants from the Real-Time Assessment of Community Transmission (REACT) program and the National Health Service (NHS) Test-and-Trace (T+T) service were invited. All relevant demographic data were extracted from T+T/REACT records.

Participants were asked to complete survey questions and record four audio clips. For the audio recording, they were asked to read a specific sentence, followed by three successful breaths, producing a “ha” sound. Furthermore, participants were asked to record forced coughs once and three times in succession. All recordings are documented in .wav format The quality of the audio recordings was assessed, and 5,157 recordings were removed for quality-related problems.

Human figures represent study participants and their respective COVID-19 infection status, with different colors depicting different demographic or symptom characteristics.  When participants are randomly divided into training and test groups, the randomized models perform well in detecting COVID-19, achieving an AUC greater than 0.8;  However, the performance of the matched test set appears to drop at an estimated AUC between 0.60 and 0.65, with an AUC of 0.5 representing random classification.  Inflated classification performance distributions can also be seen in test sets such as: planned test sets, where a selected set of demographic groups appear entirely in the test set, and longitudinal test sets, where there is no overlap in submission time between train and test instances.  The 95% confidence intervals calculated by the usual approximation method are shown with the corresponding n number of train and test sets.Human figures represent study participants and their respective COVID-19 infection status, with different colors depicting different demographic or symptom characteristics. When participants are randomly divided into training and test groups, the randomized models perform well in detecting COVID-19, achieving an AUC greater than 0.8; However, the performance of the matched test set appears to drop at an estimated AUC between 0.60 and 0.65, with an AUC of 0.5 representing random classification. Inflated classification performance distributions can also be seen in test sets such as: planned test sets, where a selected set of demographic groups appear entirely in the test set, and longitudinal test sets, where there is no overlap in submission time between train and test instances. The 95% confidence intervals calculated by the usual approximation method are shown with the corresponding n number of train and test sets.

Study results

In this study, a respiratory acoustic dataset of 67,842 individuals was collected. Among them, 23,514 persons tested positive for Covid-19. All data were linked to PCR test results. It should be noted that the most significant number of COVID-19-negative participants were recruited from the six REACT rounds compared to the T+T channel.

The dataset considered in this study demonstrated promising coverage across England. No significant association was noted between geographic location and COVID-19 status. The highest level of COVID-19 imbalance was found in Cornwall. A previous study indicated recruitment bias in ABCS, particularly associated with age, language, and gender in both training data and test sets. Despite this bias, the training dataset was balanced by age and gender across the covid-positive and covid-negative subgroups.

Consistent with previous research, the univariate analysis conducted in this study showed that AI classifiers can predict COVID-19 status with high accuracy. However, when measured founders were matched, poor performance of AI classifiers in identifying SARS-CoV-2 status was observed.

Based on the findings, the current study suggests some guidelines for future studies to correct the effects of recruitment bias. Some recommendations are listed below:

  1. Audio samples stored in the repository must include details of study recruitment criteria. In addition, relevant information about individuals, including their gender, age, time of COVID-19 testing, SARS-CoV-2 symptoms and locations must be documented with the audio recording.
  2. All confounding factors should be identified and matched to help control for recruitment bias.
  3. Experimental designs must be designed with potential bias in mind. In most cases, data matching reduces the sample size. Observational studies have recruited participants that focus on the maximum likelihood of matching the measured confounder.
  4. The predictive value of the classifiers must be compared with standard protocol results.
  5. The predictive accuracy of the AI ​​classifier must be evaluated. However, predictive accuracy, sensitivity, and specificity vary depending on the target population.
  6. The utility of the classifier must be evaluated for each test result.
  7. Replication studies should be conducted in randomized cohorts. Furthermore, pilot studies in real-world settings based on domain-specific utility are essential.

Conclusion

The present study comes with limitations including the potential for unmeasured confounding across REACT and T+T recruitment channels. For example, PCR tests for COVID-19 were performed several days after self-examination of symptoms. In contrast, PCR tests in REACT were conducted at pre-determined dates, regardless of symptom onset. Although most of the co-founders were matched, there is the possibility of the presence of residual predictive variables.

Despite the limitations, this study highlights the need to develop accurate machine-learning evaluation methods to obtain unbiased outputs. Furthermore, it revealed that establishment factors are difficult to identify and control across many AI applications.

Journal Reference:

  • Copock, H. et al. (2024) Audio-based AI classifiers show no evidence of improved COVID-19 screening compared to simple symptom checkers. Nature is machine intelligence. 1-14. DOI: 10.1038/s42256-023-00773-8, https://www.nature.com/articles/s42256-023-00773-8



Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *