Description of ImageRaTEScore: A Metric for Radiology Report Generation


Weike Zhao1,2
Chaoyi Wu1,2
Xiaoman Zhang1,2

Ya Zhang1,2
Yanfeng Wang1,2,
Weidi Xie1,2,

1Shanghai Jiao Tong University
2Shanghai AI Laboratory

website URL RaTENER Model & Demo RaTENER Dataset RaTEEval Benchmark Paper PDF


Abstract

This paper proposes a new entity-aware lightweight metric for assessing accuracy of generated medical free-form text from AI models. Our metric, termed as Radiological Report Text Evaluation (RaTEScore), is designed to focus on key medical entities, such as diagnostic outcomes, anatomies, while demonstrating robustness against complex medical synonyms and sensitivity to negation expressions. Technically, we establish a new large-scale medical NER dataset RaTE-NER and train an NER model on it. Leveraging it, we decompose complex radiological reports into medical entities. We define the final metric by comparing the similarity based on the entity embeddings computed from language model and their corresponding types, forcing the metrics to focus on clinically critical statements. In experiments, our score demonstrates superior performance on aligning with human preference than other metrics, both on the existing public benchmarks and our new proposed RaTE-Eval benchmark.



Motivation

In the literature, four main types of metrics have been adopted to assess the similarity between free-form texts in medical scenarios, as shown in Figure below. These include:

(i) Metrics based on word overlaps, such as BLEU and ROUGE. Although intuitive, these metrics fail to capture negation or synonyms in sentences, thereby neglecting the assessment of semantic factuality;

(ii) Metrics based on embedding similarities, like BERTScore. While achieving better semantic awareness, they do not focus on key medical terms, thus severely overlooking the local correctness of crucial conclusions;

(iii) Metrics based on Named Entity Recognition (NER), such as RadGraph F1 and MEDCON. Although developed specifically for the medical domain, these metrics often fail to merge synonyms and predominantly focus on Chest X-ray reports;

(iv) Metrics relying on large language models (LLMs). While these metrics are better aligned with human preferences, they suffer from potential subjective biases and are prohibitively expensive for large-scale evaluation.

Existing evaluation metrics. We illustrate the limitations of current metrics. Blue boxes represent ground-truth reports; red and yellow boxes indicate correct and incorrect generated reports, respectively. The examples show that these metrics fail to identify opposite meanings and synonyms in the reports and are often disturbed by unrelated information.



Results

R1: Results in RaTE-Eval Benchmark -- task 1.

Our metric exhibits the highest Pearson correlation coefficient with the radiologists' scoring. Note that the scores on the horizontal axis are experts counting various types of errors normalized by the potential error types that could occur in the given sentence, and subtracting this normalized score from 1 to achieve a positive correlation.

R2: Results in RaTE-Eval Benchmark -- task 2 & task 3.

Correlation coefficients with radiologists and accuracy for whether the synonym sentence can achieve higher scores than the antonymous one on Synthetic Reports. In task 2: it can be observed that RaTEScore shows a significantly higher correlation with radiology experts compared to other non-composite metrics, across various measures of correlation. In task 3: our model excels in managing synonym and antonym challenges, affirming its robustness in nuanced language processing within a medical context.

R3: Results in ReXVal dataset.

RaTEScore demonstrated a Kendall correlation coefficient of 0.527 with the error counts, surpassing all existing metrics.


For more detailed ablation studies, please refer to our paper.



General Pipeline

Illustration of the Computation of RaTEScore. Given a reference radiology report \(x\), a candidate radiology report \(\hat{x}\), we first extract the medical entity and the corresponding entity type. Then, we compute the entity embedding and find the maximum cosine similarity. The RaTEScore is computed by the weighted similarity scores that consider the pairwise entity types.

The key intuition of our proposed RaTEScore is to compare two radiological reports at the entity level. Given two radiological reports, one is the ground truth for reference, denoting as \(x\), and the other candidate for evaluation as \(\hat{x}\). We aim to define a new similarity metric \(S(x, \hat{x})\), better reflecting the clinical consistency between the two.

As shown in Figure 1, our pipeline contains three major components: namely, a medical entity recognition module \(\Phi_{\text{NER}}(\cdot)\), a synonym disambuation encoding module \(\Phi_{\text{ENC}}(\cdot)\), and a final scoring module \(\Phi_{\text{SIM}}(\cdot)\).

First, we extract the medicial entities from each piece of radiological text, then encode each entity into embeddings that are aware of medical synonyms, formulated as:

\[ \mathbf{F} = \Phi_{\text{ENC}}(\Phi_{\text{NER}}(x)), \] where \(\mathbf{F}\) contains a set of entity embeddings.
Similarly, we can get \(\mathbf{\hat{F}}\) for \(\hat{x}\). Then, we can calculate the final similarity on the entity embeddings as: \[ S(x, \hat{x}) = \Phi_{\text{SCO}}(\mathbf{F}, \mathbf{\hat{F}}). \]

The detail of each components please refer to our paper.



RaTE-NER

RaTE-NER Dataset Statistics: The dataset consists of two data sources: MIMIC-IV and Radiopaedia. # represents specific types of medical entities. For "Reports" line, the numbers in "()" are number of source reports. For the "Entities" and # lines, the numbers in "()" are counts of non-redundant entities.

To facilitate training our medical entity recognition module, we constructed a RaTE-NER dataset, a large-scale, radiological named entity recognition (NER) dataset. This dataset comprises 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, adhering to our predefined entity-labeling framework which spans 9 imaging modalities and 23 anatomical regions, ensuring broad coverage.

Given that reports in MIMIC-IV are more likely to cover common diseases, and may not well represent rarer conditions, we further enriched the dataset with 33,605 sentences from the 17432 reports available on Radiopaedia, by leveraging GPT-4 and other medical knowledge libraries to capture intricacies and nuances of less common diseases and abnormalities. More details can be found in the Appendix. We manually labeled 3,529 sentences to create a test set, as shown in Table, the RaTE-NER dataset offers a level of granularity not seen in previous datasets, with comprehensive entity annotations within sentences. This enhanced granularity enables to train models for medical entity recognition within our analytical pipeline.

Auto-annotation Part (Radiopaedia) in RaTE-NER Dataset.



RaTE-Eval Benchmark

To effectively evaluate the alignment between automatic evaluation metrics and radiologists' assessments in medical text generation tasks, we have established a comprehensive benchmark, RaTE-Eval, that encompasses three tasks, each with its official test set for fair comparison, as detailed below. The comparison of RaTE-Eval Benchmark and existed radiology report evaluation Benchmark is listed in Table.

RaTE-NER Dataset Statistics: The dataset consists of two data sources: MIMIC-IV and Radiopaedia. # represents specific types of medical entities. For "Reports" line, the numbers in "()" are number of source reports. For the "Entities" and # lines, the numbers in "()" are counts of non-redundant entities.

For more detailed about the subtasks in RaTE-Eval Benchmark, please refer to our paper.



BibTeX


	@article{zhao2024ratescore,
	  title={RaTEScore: A Metric for Radiology Report Generation},
	  author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng 
		  and Xie, Weidi},
	  journal={arXiv preprint arXiv:2406.16845},
	  year={2024}
	}