|
|
|
|
![]() |
![]() |
|
|
|
|
|
|
|
|
In the literature, four main types of metrics have been adopted to assess the similarity between free-form texts in medical scenarios, as shown in Figure below. These include:
(i) Metrics based on word overlaps, such as BLEU and ROUGE. Although intuitive, these metrics fail to capture negation or synonyms in sentences, thereby neglecting the assessment of semantic factuality;
(ii) Metrics based on embedding similarities, like BERTScore. While achieving better semantic awareness, they do not focus on key medical terms, thus severely overlooking the local correctness of crucial conclusions;
(iii) Metrics based on Named Entity Recognition (NER), such as RadGraph F1 and MEDCON. Although developed specifically for the medical domain, these metrics often fail to merge synonyms and predominantly focus on Chest X-ray reports;
(iv) Metrics relying on large language models (LLMs). While these metrics are better aligned with human preferences, they suffer from potential subjective biases and are prohibitively expensive for large-scale evaluation.
R1: Results in RaTE-Eval Benchmark -- task 1.
Our metric exhibits the highest Pearson correlation coefficient with the radiologists' scoring. Note that the scores on the horizontal axis are experts counting various types of errors normalized by the potential error types that could occur in the given sentence, and subtracting this normalized score from 1 to achieve a positive correlation.
R2: Results in RaTE-Eval Benchmark -- task 2 & task 3.
Correlation coefficients with radiologists and accuracy for whether the synonym sentence can achieve higher scores than the antonymous one on Synthetic Reports. In task 2: it can be observed that RaTEScore shows a significantly higher correlation with radiology experts compared to other non-composite metrics, across various measures of correlation. In task 3: our model excels in managing synonym and antonym challenges, affirming its robustness in nuanced language processing within a medical context.
R3: Results in ReXVal dataset.
RaTEScore demonstrated a Kendall correlation coefficient of 0.527 with the error counts, surpassing all existing metrics.
For more detailed ablation studies, please refer to our paper.
The key intuition of our proposed RaTEScore is to compare two radiological reports at the entity level. Given two radiological reports, one is the ground truth for reference, denoting as \(x\), and the other candidate for evaluation as \(\hat{x}\). We aim to define a new similarity metric \(S(x, \hat{x})\), better reflecting the clinical consistency between the two.
As shown in Figure 1, our pipeline contains three major components: namely, a medical entity recognition module \(\Phi_{\text{NER}}(\cdot)\), a synonym disambuation encoding module \(\Phi_{\text{ENC}}(\cdot)\), and a final scoring module \(\Phi_{\text{SIM}}(\cdot)\).
First, we extract the medicial entities from each piece of radiological text, then encode each entity into embeddings that are aware of medical synonyms, formulated as:
The detail of each components please refer to our paper.
To facilitate training our medical entity recognition module, we constructed a RaTE-NER dataset, a large-scale, radiological named entity recognition (NER) dataset. This dataset comprises 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, adhering to our predefined entity-labeling framework which spans 9 imaging modalities and 23 anatomical regions, ensuring broad coverage.
Given that reports in MIMIC-IV are more likely to cover common diseases, and may not well represent rarer conditions, we further enriched the dataset with 33,605 sentences from the 17432 reports available on Radiopaedia, by leveraging GPT-4 and other medical knowledge libraries to capture intricacies and nuances of less common diseases and abnormalities. More details can be found in the Appendix. We manually labeled 3,529 sentences to create a test set, as shown in Table, the RaTE-NER dataset offers a level of granularity not seen in previous datasets, with comprehensive entity annotations within sentences. This enhanced granularity enables to train models for medical entity recognition within our analytical pipeline.
To effectively evaluate the alignment between automatic evaluation metrics and radiologists' assessments in medical text generation tasks, we have established a comprehensive benchmark, RaTE-Eval, that encompasses three tasks, each with its official test set for fair comparison, as detailed below. The comparison of RaTE-Eval Benchmark and existed radiology report evaluation Benchmark is listed in Table.
For more detailed about the subtasks in RaTE-Eval Benchmark, please refer to our paper.
@inproceedings{zhao2024ratescore,
title={RaTEScore: A Metric for Radiology Report Generation},
author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={15004--15019},
year={2024}
}