Weike Zhao

赵唯珂 · PhD Candidate

I'm a PhD candidate at Shanghai Jiao Tong University (SJTU), advised by Prof. Weidi Xie and Prof. Ya Zhang.

My research focuses on Artificial Intelligence for Medicine (AI4Med), with a primary interest in developing AI diagnostic systems. I explore the use of large language models and agentic frameworks to create more reliable and interpretable clinical tools, with broader applications in multimodal and multi-omics analysis across medical domains.

Portrait of Weike Zhao

News

Research

* equal contribution  ·  † corresponding author
DeepRare system overview
Nature 2026
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao*, Chaoyi Wu*, Yanjie Fan*, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Ya Zhang†, Yongguo Yu†, Kun Sun†, Weidi Xie†

We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs.

Deep-DxSearch framework
In Submission 2026
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang†, Weidi Xie†

Deep-DxSearch is an end-to-end agentic RAG system trained with reinforcement learning for traceable diagnostic reasoning. Built on a large-scale medical retrieval corpus with tailored rewards, it consistently outperforms prompt-engineering and training-free RAG approaches — achieving substantial gains over GPT-4o, DeepSeek-R1, and other medical frameworks on both common and rare disease diagnosis.

PhenoLIP model
In Submission 2026
PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
Cheng Liang, Chaoyi Wu, Weike Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie†

PhenoLIP is a medical vision-language model that integrates structured phenotype knowledge to improve medical image analysis, leveraging PhenoKG — a new large-scale knowledge graph of 520K+ image–text pairs linked to 3,000+ phenotypes. On the PhenoBench benchmark, PhenoLIP significantly outperforms existing models.

MedRBench evaluation
Nature Communications 2025
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

We quantitatively evaluate the free-text reasoning abilities of state-of-the-art LLMs, such as DeepSeek-R1 and OpenAI o3-mini, on assessment recommendation, diagnostic decision, and treatment planning.

RaTEScore metric
EMNLP Main 2024
RaTEScore: A Metric for Radiology Report Generation

RaTEScore is an entity-aware metric for assessing AI-generated medical reports. It emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, is robust to complex medical synonyms, and is sensitive to negation — aligning more closely with human preference than existing metrics.

RP3D-Diag architecture
Nature Communications 2024
Large-scale Long-tailed Disease Diagnosis on Radiology Images

We build an academically accessible, large-scale diagnostic dataset covering 5,568 disorders linked to 930 unique ICD-10-CM codes — 39,026 cases and 192,675 scans — and present a novel architecture that processes an arbitrary number of input scans across imaging modalities, establishing a new benchmark for multi-modal, multi-anatomy long-tailed diagnosis.

GPT-4V medical evaluation
Technical Report 2023
Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis
Chaoyi Wu*, Jiayu Lei*, Qiaoyu Zheng*, Weike Zhao*, Weixiong Lin*, Xiaoman Zhang*, Xiao Zhou*, Ziheng Zhao*, Ya Zhang, Yanfeng Wang, Weidi Xie†

We evaluate GPT-4V for multimodal medical diagnosis through case studies covering 17 human body systems across 8 clinical imaging modalities. As the cases show, GPT-4V remains far from clinical usage.

Beyond Research

When I'm not training models, you'll probably find me here:

🏂🎬🎯🎵🎻🏃🏊🏓🏸📸🤸🏔🏑🎾