In a recent study published in the journal Dr npj Precision OncologyResearchers conducted a systematic review to examine the accuracy of deep learning (DL) in breast cancer diagnosis using ultrasound (US) compared to human readers in clinical settings.
They found that there was insufficient evidence to determine whether DL outperforms human readers or increases the accuracy of diagnostic breast US in clinical settings.
Study: Diagnostic performance of deep learning in ultrasound diagnosis of breast cancer: a systematic review. Image credit: Gorodenkoff/Shutterstock.com
Breast cancer, the most common cancer worldwide, killed 685,000 people in 2020. Early and accurate diagnosis is very important.
US serves as a low-cost, radiation-free, and effective diagnostic tool, providing guidance for biopsy procedures, especially in cases of dense breast tissue or occult lesions. However, its diagnostic efficacy and reproducibility are hampered by operator-dependent factors.
DL is a powerful artificial intelligence technology that has been shown to perform well in image-related tasks, increasing the efficiency and accuracy of medical imaging workflows, especially in diagnosing diseases such as cancer.
Recent reports suggest that DL-based analysis of breast US may be equivalent or superior to that of human radiologists, but its clinical application remains controversial.
Therefore, the researchers in the current review focused on the general diagnostic performance of DL in breast US, compared individual DL systems to radiologists, and evaluated the helpful role of DL alongside human readers.
About the study
In the present study, a database search followed by application of strict inclusion and exclusion criteria ultimately resulted in 16 studies involving 9,238 women from different countries.
These studies were selected based on the PICO (short for Population, Intervention, Comparison, Outcome) framework and used DL convolutional neural networks, 14 of which employed commercial DL systems.
Most of the included studies were in a diagnostic setting and pathology was presented as the gold standard in all of them. Study quality was assessed using the appropriate version of the Quality for Diagnostic Studies-2 (QUADAS-2) and QUADAS-C tools.
DL can be used as a standalone tool or employed to assist radiologists with the goal of enhancing diagnostic capabilities.
Four studies evaluated DL as a stand-alone, two as an adjunct, and ten explored both roles. Human readers with different levels of clinical experience in breast ultrasound were recruited to evaluate DL performance.
Results and discussion
In 14 studies evaluating DL as a stand-alone system in breast-US, comparisons were made with human readers. Although one study found that DL had a lower area under the curve (AUC) than human readers, two showed equivalent AUC, and one reported a higher AUC for DL.
DL exhibited greater AUC than less experienced human readers but was comparable to experienced readers in all three studies. Regarding accuracy, DL outperformed all human readers in two studies and outperformed less experienced readers but was found to be comparable to experienced readers in another study.
DL showed lower sensitivity than human readers in five studies and higher specificity in five studies, with mixed results in the remaining studies.
Of the 12 studies evaluating DL systems adjunctive to breast-US, three reported improved AUC when combined with human readers. One study showed AUC to be comparable to human readers. For less experienced human readers, the AUC of the assisted DL system was higher but had no positive effect on experienced readers.
During accuracy testing, the assistive DL system showed higher accuracy than human readers in three studies. However, no improvement in overall sensitivity was observed when combining DL with human readers.
Seven studies using assisted DL systems found higher specificity in human readers, with effects on specificity varying for experienced and less experienced readers.
During the quality assessment, the studies included in the current review demonstrated a high risk of bias in several domains. Most studies have shown a high bias in patient selection due to cancer prevalence being significantly higher than in real-world situations.
Additionally, study designs did not fully replicate clinical pathways, as DL systems were used to read images but were not incorporated into final clinical decisions. Human readers lacked access to patient clinical information on test pathways and reference standards differed between studies.
Notably, some studies had a shorter follow-up time for women with negative tests, potentially affecting the assessment of missed cancers and overall diagnostic accuracy.
In conclusion, this comprehensive review evaluating the diagnostic performance of DL systems in breast-US revealed significant variability in results.
Although DL systems have demonstrated potential specificity benefits, no consensus has emerged regarding AUC, accuracy, or sensitivity, whether used standalone or as a human reader aid.
Concerns were raised about bias, study heterogeneity, and limitations of generalizability, particularly in Asia-focused studies. The review emphasizes the need for standardized DL research guidelines, consistent benchmarks, and multicenter trials for reproducibility and clinical applicability.
Current evidence does not support a broad clinical recommendation for the DL system in breast-US, calling for further research and development in this area.