logo

Newsletter

Healthcare LLM Evaluations Loading...

Healthcare LLM Evaluations

Every day, nearly 50 women die in India from preventable pregnancy complications. These deaths are often driven by three critical delays: recognizing danger signs and deciding to seek care, reaching a facility, and receiving treatment. The first delay is driven by limited health literacy and claims lives that could be saved with timely, accurate information. This is the gap Sakhi was built to close.

Can AI models deliver safe maternal health guidance in Hindi and Marathi? We are finding out.

We are currently benchmarking leading AI models using the Sakhi Dataset, a parallel set of maternal health questions undergoing expert validation in English, Hindi, and Marathi. Our evaluation framework measures clinical accuracy, semantic consistency, and linguistic clarity through automated metrics and expert-designed rubrics.

Early findings reveal a critical safety concern: AI models perform inconsistently across languages. As millions worldwide turn to Large Language Models for health advice, this language gap could deliver incomplete or inaccurate maternal health guidance at scale, especially for communities that don't speak English.

Systems must work reliably in high-stakes situations like pregnancy. Language equity isn't just about access; it is a safety imperative.

The Story Behind Sakhi

Sakhi

Sakhi is a WhatsApp-based maternal health literacy conversational agent providing reliable, verified information in local languages. Research confirms that maternal mortality is driven by three critical delays: recognizing danger signs and deciding to seek care, reaching a facility, and receiving treatment. The first delay, which is limited health literacy, is a major factor behind low antenatal-care utilization in India.

Over recent months, we surveyed women in Jalgaon, Maharashtra, and piloted Sakhi using a human-in-the-loop approach, where healthcare professionals reviewed content and handled queries beyond our curated, expert-verified knowledge base.

Why This Matters

AI healthcare tools reach millions in India, but safety evaluations focus overwhelmingly on English. We don't know how models perform in Hindi and Marathi, risking misleading guidance at scale for the populations who need it most. Our goal is to help ensure that future health AI systems provide safe and equitable guidance across all languages and communities.

Learn more about the project:
ai4health.simppl.org/projects/medical-ai-evaluation

Join us as a reviewer: Medical professionals can contribute here:
healthreview.simppl.org


Recent Highlights

Splice Beta 2025 Conference: We attended the Splice Beta 2025 conference in Chiang Mai, Thailand, connecting with the global community working at the intersection of AI, digital discourse, and social impact.

We attended the #GTSInnovationalogue, co-hosted by Carnegie India and the Ministry of External Affairs, India, to discuss AI adoption frameworks and share our learnings from Sakhi's adoption.

Sakhi is featured as a use case on People+AI's Use Case Adoption Framework (peopleplus.ai/ucaf), which maps AI adoption challenges and horizontal enablers like multilingual AI, safety, and data democratization.


Builds, bugs, and breakthroughs

AI Scientist Thinks: Watching Artificial Intelligence Reason

An interface that makes AI Scientist-v2's reasoning process interpretable, allowing users to observe how artificial intelligence thinks and extends its context length for more complex reasoning tasks.

Learn More

Distributed GPU Training

A comprehensive breakdown of distributed GPU training for large-scale models, covering silicon-level details (GPU vs. CPU architecture, CUDA, memory hierarchy), parallelism paradigms (data, pipeline, tensor, and 3D parallelism), and infrastructure challenges.

Learn More

C++ Design Patterns for Low-Latency Applications

Advanced C++ design patterns tailored for low-latency applications, with a focus on high-frequency trading systems. Provides practical guidance and benchmarks for academics and industry professionals.

Learn More

Articles we are obsessed with

Unpacking Deceptive Design

This Google Public Policy article explores how deceptive designs (dark patterns) affect user trust online. Based on a survey of 12,000 users across six European countries, it reveals that people consistently identify manipulative designs.

Learn More

Law and Policy: Digital Services Act

The legal implications of the Digital Services Act (DSA), focusing on how it addresses digital platform regulation and the challenges of ensuring fair and transparent online services.

Learn More

DSA Platforms: Digital Services Act

The practical implementation and legal framework of the Digital Services Act, emphasizing its role in regulating platforms and safeguarding fundamental rights in the digital space.

Learn More

Cool research to follow

How Conversational Structure and Style Shape Online Community Experiences

This study investigates how the structure and linguistic style of conversations in online communities predict a sense of virtual community (SOVC) using data from over 2,800 Reddit users across diverse subreddits.

Learn More

ChatGPT Does Not Replicate Human Moral Judgments

This research finds that while ChatGPT ratings of moral scenarios correlate strongly with average human judgments, the AI systematically deviates in predictable ways, producing more extreme ratings.

Learn More

Polarization Is Increasing Online

A recent PNAS paper highlights the growing polarization in online spaces, showing that both ideological divides and emotional intensity are on the rise across major social media platforms.

Learn More

Useful data drops

The Emerging Market for Intelligence: Pricing, Supply, and Demand for LLMs

This paper analyzes the rapidly evolving LLM market using API usage data from OpenRouter and Microsoft Azure, documenting six key facts: rapid growth in models and providers, significant price declines with open-source models being much cheaper.

Learn More

Global Claims: A Multilingual Dataset of Fact-Checked Claims

Global Claims is a large-scale dataset of 67,000 fact-checked claims collected from over 200 fact-checking websites in 50 languages.

Learn More

Global YouTube Trending Dataset (2022-2025)

This dataset captures three years of YouTube Trending videos from July 1, 2022, to June 30, 2025, with four daily snapshots for each of 104 countries, totaling 446,971 snapshots and 78.4 million video entries.

Learn More