While OpenAI capabilities have made its way into every domain possible, there’s one field where LLMs, if utilised correctly, can have the highest impact by directly affecting lives — the medical field. Earlier this year, ChatGPT had even cleared all three parts of the United States Medical Licensing Examination (USMLE) and we even saw how ChatGPT helped save a dog’s life through accurate medical diagnosis. However, we have not seen much practical applications in the medical field. Does GPT-4 capabilities make it a suitable player in the medical field?
Massive Potential
A paper released by OpenAI and Microsoft on the Capabilities of GPT-4 on Medical Challenge Problems was released in March, this year. In this research, GPT-4 have shown impressive language understanding and generation abilities in medicine. The study evaluates GPT-4’s performance on medical competency exams and benchmark datasets, even though the model wasn’t specialised for medicine.
The researchers assess GPT-4’s performance on official USMLE practice materials and MultiMedQA datasets. GPT-4 surpasses the USMLE passing score by over 20 points, outperforming previous models (including GPT-3.5) and even models fine-tuned for medical knowledge. Additionally, GPT-4 demonstrates improved probability calibration, implying that it’s better at predicting correct answers. The study also explores how GPT-4 can explain medical reasoning, customise explanations, and create hypothetical scenarios, showcasing its potential for medical education and practice. The findings highlight GPT-4’s capabilities while acknowledging challenges related to accuracy and safety in real-world applications.
In comparison to its older models, GPT-4 has gotten much better when tested on official medical exams such as USMLE. GPT-4 improved by more than 30 percentage points when compared to GPT-3.5. While GPT-3.5 was getting close to this passing score (60% of multiple-choice questions to be correct), GPT-4 passed the score by a huge number.
Alignment and Safety In Place
When an earlier version of GPT-4, referred to as the base model, was compared with GPT-4, the former had slightly better performance by about 3-5% on some of the tests. This suggests that when the model was made safer and better at following instructions, it might have lost a bit of its raw performance. The researchers suggested that future work could focus on finding ways to balance accuracy and safety more effectively by refining the training process or by using specialised medical data.
Where does Med-PaLM fit in?
The above research did not compare GPT-4 with models such as Med-PaLM and Flan-PaLM 540B, as the models were not available for everyone to try at the time of study.
Google recently launched their multimodal healthcare LLM with Med-PaLMM – a large multimodal generative model that encodes and interprets biomedical data. Its capabilities are far more advanced than GPT-4 considering how it can handle various types of medical data such as clinical language, medical images, genomics and even performs a wide range of tasks. The model can generalise to new medical tasks and perform multimodal reasoning without specific training. It is able to precisely recognize and explain medical conditions in images using just instructions and prompts given in language.
Never Fool-Proof
However, GPT-4 applications are not as diverse as the ones Med-PaLM offers. Though GPT-4 was announced with multimodal features, it is not yet available for users. Furthermore, there have been negative observations on GPT-4’s capabilities in medical diagnosis. Problematic and biased results were part of the outcome, and concerns on how GPT-4’s inclination to embed societal biases may hamper its suitability for aiding clinical decisions.
The prevalent problem of hallucinations still persists with GPT-4 spewing incorrect information. The model has been generating incorrect answers for medical citations. GPT-4 produced over 20% errors for medical citations.
While GPT-4 might not be completely reliable as a medical assist for diagnosis with the current performance , there are other functions that the model can assist in. Hospitals are looking at AI to help relieve doctor burnout. With applications that can write notes for electronic health records and drafting empathetic notes to patients, AI can help smoothen the process. Transcribing doctor and patient comments, then creating physician’s summary format for electronic health records is one of the best use cases in the medical field. With the current limitations, GPT-4 still has a long way to go before it can be entirely adopted in the medical field.
As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.
- alignment, ChatGPT, Flan-PaLM 540B, gpt-4, GPT4, LLM, Med-PaLM, medical field, OpenAI, safety, USMLE
Related Posts
Chinese AI Companies Surpass American Rivals
Anshul Vipat17/07/2024
Chinese Company SenseTime Releases SenseNova 5.5, Beats OpenAI’s GPT-4o
Siddharth Jindal16/07/2024
Microsoft Introduces SPREADSHEETLLM for Efficient Spreadsheet Understanding
Gopika Raj15/07/2024
OpenAI Secretly Working on Project ‘Strawberry’ to Enhance Reasoning and Build Autonomous AI Agents
Siddharth Jindal13/07/2024
OpenAI Clocks $3.4 Bn in Revenue from ChatGPT Subscriptions
Siddharth Jindal12/07/2024
OpenAI CTO Mira Murati is an Absolute PR Disaster
Tarunya S11/07/2024
OpenAI Partners with Lab that Built the Atomic Bomb for AI Bioscience Research
Siddharth Jindal11/07/2024
‘Odyssey’ AI Built for Hollywood, Sora Can Wait
Vandana Nair09/07/2024
Upcoming Large format Conference
Cypher 2024India's Biggest AI Summit
Sep 25-27, 2024 | 📍 Bangalore, India
Knowledge Graphs are Making LLMs Less Dumb
Sagar Sharma
Knowledge graphs help reducing AI hallucinations, provides up-to-date information, and leverages the relationships between data points to enhance the quality of AI-generated content.
Digital Twin in Space Research Cuts 100 Years of Work Down to 2 Years
Vandana Nair
Sysdig is Here to Save You from Cloud Nightmares
Siddharth Jindal
Top Editorial Picks
Sagar Sharma
Google DeepMind’s FLAMe Models Outperform GPT-4 and Claude 3 in AI Evaluation Tasks
Siddharth Jindal
Shritama Saha
Google DeepMind Launches MatFormer Framework to Improve On-Device AI Capabilities
Donna Eva
Anthropic Doubles Claude 3.5 Sonnet API’s Output Token Limit to 8K Tokens
Shyam Nandan Upadhyay
Google’s Gemini Fuels Innovation for Karya, Miko, and Other Indian GenAI Startups
Gopika Raj
Microsoft CTO Kevin Scott Joins Shopify’s Board of Directors
Vandana Nair
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Flagship Events
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration withNVIDIA.
GenAI
Corner
View All
Anthropic Launches Claude AI Chatbot for Android to Expand Mobile Reach
Google Introduces IndicGenBench to Benchmark Indic LLMs Across 29 Languages
Google, MeitY Startup Hub to train 10,000 Indian startups in AI
Google Maps API To Cost 70% Less Now
NVIDIA Acquires AI Development Platform Brev
Google-Backed Cropin’s New AI Platform Could Tackle Food Crisis
LlamaIndex Unveils Notebook Implementation of GraphRAG
OpenAI Cofounder Andrej Karpathy Launches AI+Education Company Eureka Labs