RaTEScore: A Metric for Radiology Report Generation

This paper proposes a new entity-aware lightweight metric for assessing accuracy of generated medical free-form text from AI models. Our metric, termed as Radiological Report Text Evaluation (RaTEScore), is designed to focus on key medical entities, such as diagnostic outcomes, anatomies, while demonstrating robustness against complex medical synonyms and sensitivity to negation expressions. Technically, we establish a new large-scale medical NER dataset RaTE-NER and train an NER model on it. Leveraging it, we decompose complex radiological reports into medical entities. We define the final metric by comparing the similarity based on the entity embeddings computed from language model and their corresponding types, forcing the metrics to focus on clinically critical statements. In experiments, our score demonstrates superior performance on aligning with human preference than other metrics, both on the existing public benchmarks and our new proposed RaTE-Eval benchmark.

In the literature, four main types of metrics have been adopted to assess the similarity between free-form texts in medical scenarios, as shown in Figure below. These include:

(i) Metrics based on word overlaps, such as BLEU and ROUGE. Although intuitive, these metrics fail to capture negation or synonyms in sentences, thereby neglecting the assessment of semantic factuality;

(ii) Metrics based on embedding similarities, like BERTScore. While achieving better semantic awareness, they do not focus on key medical terms, thus severely overlooking the local correctness of crucial conclusions;

(iii) Metrics based on Named Entity Recognition (NER), such as RadGraph F1 and MEDCON. Although developed specifically for the medical domain, these metrics often fail to merge synonyms and predominantly focus on Chest X-ray reports;

(iv) Metrics relying on large language models (LLMs). While these metrics are better aligned with human preferences, they suffer from potential subjective biases and are prohibitively expensive for large-scale evaluation.

Existing evaluation metrics. We illustrate the limitations of current metrics. Blue boxes represent ground-truth reports; red and yellow boxes indicate correct and incorrect generated reports, respectively. The examples show that these metrics fail to identify opposite meanings and synonyms in the reports and are often disturbed by unrelated information.

Our metric exhibits the highest Pearson correlation coefficient with the radiologists' scoring. Note that the scores on the horizontal axis are experts counting various types of errors normalized by the potential error types that could occur in the given sentence, and subtracting this normalized score from 1 to achieve a positive correlation.

Correlation coefficients with radiologists and accuracy for whether the synonym sentence can achieve higher scores than the antonymous one on Synthetic Reports. In task 2: it can be observed that RaTEScore shows a significantly higher correlation with radiology experts compared to other non-composite metrics, across various measures of correlation. In task 3: our model excels in managing synonym and antonym challenges, affirming its robustness in nuanced language processing within a medical context.

RaTEScore demonstrated a Kendall correlation coefficient of 0.527 with the error counts, surpassing all existing metrics.

The key intuition of our proposed RaTEScore is to compare two radiological reports at the entity level. Given two radiological reports, one is the ground truth for reference, denoting as \(x\), and the other candidate for evaluation as \(\hat{x}\). We aim to define a new similarity metric \(S(x, \hat{x})\), better reflecting the clinical consistency between the two.

As shown in Figure 1, our pipeline contains three major components: namely, a medical entity recognition module \(\Phi_{\text{NER}}(\cdot)\), a synonym disambuation encoding module \(\Phi_{\text{ENC}}(\cdot)\), and a final scoring module \(\Phi_{\text{SIM}}(\cdot)\).

First, we extract the medicial entities from each piece of radiological text, then encode each entity into embeddings that are aware of medical synonyms, formulated as:

\[ \mathbf{F} = \Phi_{\text{ENC}}(\Phi_{\text{NER}}(x)), \] where \(\mathbf{F}\) contains a set of entity embeddings.

Similarly, we can get \(\mathbf{\hat{F}}\) for \(\hat{x}\). Then, we can calculate the final similarity on the entity embeddings as: \[ S(x, \hat{x}) = \Phi_{\text{SCO}}(\mathbf{F}, \mathbf{\hat{F}}). \]

To facilitate training our medical entity recognition module, we constructed a RaTE-NER dataset, a large-scale, radiological named entity recognition (NER) dataset. This dataset comprises 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, adhering to our predefined entity-labeling framework which spans 9 imaging modalities and 23 anatomical regions, ensuring broad coverage.

Given that reports in MIMIC-IV are more likely to cover common diseases, and may not well represent rarer conditions, we further enriched the dataset with 33,605 sentences from the 17432 reports available on Radiopaedia, by leveraging GPT-4 and other medical knowledge libraries to capture intricacies and nuances of less common diseases and abnormalities. More details can be found in the Appendix. We manually labeled 3,529 sentences to create a test set, as shown in Table, the RaTE-NER dataset offers a level of granularity not seen in previous datasets, with comprehensive entity annotations within sentences. This enhanced granularity enables to train models for medical entity recognition within our analytical pipeline.

To effectively evaluate the alignment between automatic evaluation metrics and radiologists' assessments in medical text generation tasks, we have established a comprehensive benchmark, RaTE-Eval, that encompasses three tasks, each with its official test set for fair comparison, as detailed below. The comparison of RaTE-Eval Benchmark and existed radiology report evaluation Benchmark is listed in Table.

For more detailed about the subtasks in RaTE-Eval Benchmark, please refer to our paper.


	@inproceedings{zhao2024ratescore,
		  title={RaTEScore: A Metric for Radiology Report Generation},
		  author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
		  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
		  pages={15004--15019},
		  year={2024}
		}

Abstract

Motivation

Results

General Pipeline

RaTE-NER

RaTE-Eval Benchmark

BibTeX