Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Benchmark (surveying) Popularity
DOI: 10.48550/arxiv.2303.18027 Publication Date: 2023-01-01
ABSTRACT
As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them better understand model behaviors, failures, and limitations in languages beyond English. In this work, evaluate LLM APIs (ChatGPT, GPT-3, GPT-4) on the Japanese national medical licensing examinations from past five years, including current year. Our team comprises native Japanese-speaking NLP researchers a practicing cardiologist based Japan. experiments show GPT-4 outperforms ChatGPT GPT-3 passes all six years exams, highlighting LLMs' potential typologically distant However, our evaluation also exposes critical APIs. First, LLMs sometimes select prohibited choices should be strictly avoided practice Japan, such as suggesting euthanasia. Further, analysis shows API costs are generally higher maximum context size smaller for because way non-Latin scripts currently tokenized pipeline. We release Igaku QA well outputs exam metadata. hope results will spur progress more applications LLMs. available at https://github.com/jungokasai/IgakuQA.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....