← Back to Blog

// Posted by Umur Inan

// Category Thinking

// Posted on March 28, 2026

Healthcare AI Has an Engineering Problem

Healthcare AI fails not because the models are bad, but because the surrounding software is. A software engineer's take on why it is harder than it looks.

By Umur Inan · 9 min read

Every few months, a new paper drops showing that an AI model matches or beats doctors at diagnosing something. Skin cancer from photos. Diabetic retinopathy from retinal scans. Pneumonia from chest X-rays. The headlines write themselves. The benchmarks are impressive. And almost none of it matters in a real hospital.

I'm not a doctor. I'm a software engineer. And from where I sit, the problem with healthcare AI isn't the models. The models are fine. Some of them are genuinely good. The problem is everything else: integration, reliability, failure modes, workflows. All the parts that don't make it into the paper but determine whether the thing actually works when a doctor is standing in front of a patient at 2 AM.

This is a software engineering problem. And we're not treating it like one.

The Demo vs. The Deployment

I've spent enough time in software to know the gap between a working demo and a production system. In most domains, that gap is large. In healthcare, it's a canyon.

Here's what a typical healthcare AI demo looks like. You have a curated dataset. The images are high quality. The labels were reviewed by specialists. You train a model. It hits 95% accuracy on the test set. You publish the paper. Everyone is excited.

Now try deploying that in an actual hospital. The images come from five different scanner models, each with slightly different color profiles and resolutions. Some scans are taken by experienced technicians, others by someone covering the night shift for the first time. Lighting varies. A patient moved during the scan. Metadata is incomplete or formatted differently across departments.

Your 95% accuracy model is now running on data that looks nothing like what it trained on. And you might not even know it's failing, because in production, there's no labeled test set telling you when you're wrong.

This is called distribution shift, and it's not a new concept. But in healthcare, the consequences of ignoring it are measured in misdiagnoses, not just bad recommendations on a shopping site.

Integration Is Where Good Models Go to Die

Let's say you've built a model that actually works well on real clinical data. Great. Now you need to put it somewhere a doctor can use it. This is where it gets ugly.

Hospital IT systems are a world of their own. Most of them run on HL7 and FHIR standards for data exchange, except when they don't. Electronic Health Record systems from different vendors store data in different formats, use different coding systems, and expose different APIs. Some of them expose no APIs at all and require custom integration through middleware that was written in 2008 and hasn't been updated since.

I've talked to engineers who've spent months just getting read access to patient data in a format their model can consume. Not building the model. Not improving accuracy. Just parsing the data correctly and making sure it arrives in the right shape at the right time.

And then there's the question of where the AI output goes. Does it show up as an alert in the EHR? A separate dashboard? A notification on a pager? Each option has workflow implications. Doctors already deal with alert fatigue from dozens of existing systems. Adding another notification source that fires false positives will get ignored within a week, no matter how good the underlying model is.

The integration problem isn't glamorous. It doesn't get published in Nature. But it's the reason most healthcare AI prototypes never reach a patient.

LLMs in the Clinic: Useful and Dangerous

Large language models have added a new dimension to this. The potential applications are obvious and genuinely useful: generating clinical notes, summarizing patient histories, translating medical jargon into language patients can understand, helping doctors write referral letters in half the time.

Some of these are already in use. Ambient clinical documentation tools that listen to doctor-patient conversations and generate structured notes are rolling out at major health systems. Doctors who use them report spending less time on paperwork. That's a real win. Documentation burden is one of the top drivers of physician burnout, and anything that reduces it without sacrificing quality is valuable.

But LLMs also hallucinate. And in medicine, a hallucination isn't a funny wrong answer. It's a fabricated lab result in a patient summary. It's a drug interaction that doesn't exist presented as fact. It's a confident, well-written paragraph that happens to be clinically wrong.

As engineers, we know that LLM outputs need verification. But the whole point of these tools in healthcare is to save doctors time. If a doctor has to carefully verify every line of an AI-generated note, the time savings disappear. You end up with a tool that's fast but untrustworthy, which is worse than a slow tool that's reliable.

The engineering challenge here is building the right guardrails. Not just prompt engineering, but actual systems. Cross-referencing generated text against the patient's actual records. Flagging when the model mentions a medication the patient isn't on. Highlighting claims that can't be traced back to source data. These are retrieval, validation, and citation problems. They're solvable, but they require serious engineering effort that goes way beyond fine-tuning a model.

The Failure Mode Problem

In most software I build, failure is annoying but manageable. An API returns a 500 error. A page doesn't load. A notification arrives late. Users are frustrated, we fix it, life goes on.

In healthcare, the failure modes are different. A model that misclassifies a benign mole as malignant triggers an unnecessary biopsy. Stressful and costly, but the patient is fine. A model that misclassifies a malignant mole as benign means cancer goes undetected. That's a fundamentally different category of failure.

This asymmetry changes how you need to build the system. You can't just optimize for overall accuracy. You need to think about which direction of failure is worse and calibrate accordingly. In some cases, you want the model to be overly cautious, flagging anything remotely suspicious even at the cost of more false positives. In other cases, that approach would swamp clinicians and make the tool useless.

These are product decisions as much as technical ones. But they need to be made by people who understand both the clinical context and the engineering trade-offs. And right now, that intersection is pretty thin.

What worries me more is silent failure. A model that confidently gives a wrong answer and no one catches it. In a recommendation system, silent failure means someone sees a weird product suggestion. In healthcare, silent failure means a missed diagnosis that might not surface for months or years. By the time anyone realizes the model was wrong, the connection to the AI output is long gone.

Building systems that know when they don't know is hard. Uncertainty quantification in ML is an active research area, and most production models don't implement it well. For healthcare, this isn't optional. A model that says "I'm 52% confident this is benign" is far more useful than one that just says "benign."

Regulation Isn't the Enemy

Software engineers often view regulation as friction. In most contexts, I'd agree. But in healthcare AI, regulation exists for reasons that make sense if you think about them for more than five minutes.

The FDA now regulates AI-based medical devices through pathways like 510(k) and De Novo. This means if your model makes clinical decisions, it needs to go through a review process. You need to document your training data, your validation methodology, your intended use population, plus your performance characteristics. You need to show that the model works not just on average, but across demographic subgroups.

That last part matters a lot. There's well-documented evidence that AI models trained primarily on data from certain populations perform worse on others. A dermatology model trained mostly on lighter skin tones will miss melanoma on darker skin. An NLP model trained on clinical notes written in American English will misparse notes from non-native English speakers. These aren't hypothetical concerns. They've been measured and published.

As engineers, we should be building with these requirements in mind from the start, not treating them as a box to check after development is done. That means diverse training data. That means stratified evaluation. That means monitoring model performance across subgroups in production, not just at launch.

The regulatory process is slow and sometimes frustrating. But the alternative, deploying clinical AI with no oversight, is how you get models that work great in the demo and fail on the patients who need them most.

What Doctors Actually Need

I've talked to a handful of doctors about how they see AI tools. The pattern is consistent. They're not impressed by benchmark numbers. They don't care that your model beats a specialist on a curated dataset. What they want is much simpler.

They want tools that save them time without adding risk. The AI should handle the paperwork so they can focus on the patient. Better search over patient records matters too: right now, finding a relevant note from three years ago means scrolling through 200 pages. Administrative burden eats half their day, and that is what they want help with.

Notice what's not on that list. They're not asking for AI to diagnose patients. Most experienced doctors trust their clinical judgment, and honestly, they should. What they're drowning in isn't diagnostic uncertainty. It's paperwork, prior authorizations, coding for billing, updating records, writing referral letters, documenting every conversation for legal compliance.

The highest-impact applications of AI in medicine might be the least exciting ones from a research perspective. They're not about replacing the doctor's brain. They're about replacing the doctor's keyboard.

An LLM that generates a first draft of a discharge summary from structured data? That saves 15 minutes per patient. Multiply by 20 patients a day, and you just gave a doctor back five hours a week. That's time with patients, time thinking, time not burning out.

An ML model that pre-fills prior authorization forms by matching patient records to insurance requirements? Boring as it gets. Also probably the single most impactful thing you could build for a primary care physician right now.

The Accountability Gap

Here's the question nobody has a great answer to yet: when an AI system contributes to a medical error, who's responsible?

The doctor who followed the AI's recommendation? Or the hospital that deployed the system? Maybe the company that built the model. Possibly the engineering team that chose the training data. Or the product manager who decided the confidence threshold.

In traditional software, liability is complicated enough. In healthcare AI, it's a mess. Current malpractice frameworks assume a human decision-maker. If a doctor misdiagnoses something, there's a clear chain of accountability. If a doctor follows an AI recommendation that turns out to be wrong, the situation gets murky fast.

This isn't just a legal question. It affects how doctors interact with AI tools. If they're potentially liable for AI errors, they'll either over-rely on the tool ("the AI said it was fine") or refuse to use it at all. Neither outcome is good.

From an engineering perspective, this means we need to build systems that support human decision-making, not stand in for it. The AI should present information, not make decisions. It should show its reasoning, not just its conclusion. It should make it easy for the doctor to agree or disagree, with both options being natural parts of the workflow.

This is a UX problem, an architecture problem, and a policy problem all at once. And it needs engineers, doctors, and policymakers in the same room to solve it.

Where I Think This Is Going

I'm cautiously optimistic about AI in healthcare, but for different reasons than the people writing the headlines. I don't think we're five years away from AI replacing doctors. I don't think we should be. Medicine involves empathy, judgment under uncertainty, ethical reasoning, and the kind of contextual understanding that current AI systems genuinely can't do.

What I do think is that AI can make doctors' lives meaningfully better. Not by being a replacement, but by being good software. Software that handles the tedious parts. Software that surfaces relevant information at the right time. Software that works reliably within the messy reality of hospital IT systems.

Getting there requires treating healthcare AI as an engineering discipline, not just a research one. It means spending as much time on integration, reliability, monitoring, plus failure handling as on model accuracy. It means building for the doctor's workflow, not for the benchmark leaderboard.

The models are ready. The infrastructure mostly isn't. And that's our problem to solve.

AI Products Software Thinking

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

GitHub LinkedIn Email

👁 0 9 min read