01 — The ask
"Can you use AI to find our high-cost patients?"
That was the brief, more or less verbatim. The clinic runs on Practice Fusion, sees thousands of patients, and could feel that some of them — the ones cycling through the ER, falling out of follow-up, drifting off their care plans — were both the most at risk and the most expensive. They wanted a model to point at them.
"Use AI" is a solution wearing the costume of a problem. The useful move wasn't to nod and reach for a model. It was to ask what decision this was supposed to make easier, and what they'd actually need in place before any model could earn its keep.
02 — The real problem
They didn't have a model problem. They had a data problem.
Their patient history lived inside the EHR, in a shape built for running a clinic visit-by-visit — not for asking "who, across the whole panel, is rising in risk right now?" There was no local copy they could query, no way to look across encounters, labs, and appointments at once. You cannot train or trust a model on data you can't even pull together and inspect.
What they asked for
An ML model to flag high-cost patients.
What they needed first
Their own records pulled into a database they could question — and a transparent way to score high-risk patients that a clinician could read, trust, and push back on.
So the plan inverted the request. Build the data foundation and a rule-based, fully explainable risk score now — something that delivers a usable worklist this month and that a skeptical clinician can audit line by line. Treat machine learning as a later improvement layered on that foundation, once there's a baseline to beat and clean history to learn from. Not AI from the get-go; AI once it has somewhere meaningful to stand.
03 — Constraints that shaped everything
The non-negotiables came from the room, not the README.
- PHI never leaves the practice. The whole tool runs locally against their own export. No cloud, no third party holding patient data.
- Every score has to explain itself. A black box that says "this patient is a 92" is useless to a clinician who has to act on it. The "why" is part of the output, not an afterthought.
- It has to run every week, by a normal person. A handful of commands, repeatable, hard to get wrong.
- It has to survive me. Documented, dependency-light, legible to whoever maintains it next.
04 — The system
A small pipeline, each stage inspectable.
Practice Fusion exposes a Bulk FHIR export. That goes into a single-file DuckDB warehouse — fast analytical SQL over the whole panel, no server to run. A configurable rubric scores every patient; each run is versioned and audited so you can always answer "why did this number change?"
- 01Bulk FHIR export
- 02DuckDB warehouse
- 03Rubric scoring engine
- 04Versioned, audited run
- 05Weekly worklist
The rubric lives in a YAML file — weights and thresholds a human can read and adjust, not magic numbers buried in code. Change the rubric and the CLI can diff the new run against the old one before anything reaches the people using it.
On top of the engine sits the cockpit — the screen a nurse actually uses to run the week and read the list. It started as a Streamlit app; as the workflow grew (preflight checks, import safety, outreach logging) it began outgrowing what Streamlit can hold, so it's being rebuilt on a FastAPI backend with a real JS front end.
05 — Working proof
What a Monday actually looks like.
Under the cockpit, the engine is a CLI — pfbulk — and it
stays the source of truth: ingest the week's export, score the panel,
get a ranked worklist where every name carries its reasons. (Synthetic
data shown here; real runs never leave the clinic.)
$ pfbulk ingest-downloaded --run-dir data/raw/pf/2025-05-26
✓ 4,812 patients · 38,902 encounters · 9,140 lab results
$ pfbulk score-vip --snapshot-date 2025-05-26
scoring 4,812 patients · rubric: vip.yaml (v1)
──────────────────────────────────────────────
92 Pt #4821 ER visit (12d) · post-hospital, no follow-up · A1c overdue
87 Pt #3307 3 missed appts (90d) · 5 chronic conditions · no future visit
81 Pt #5092 post-hospital (5d) · no follow-up scheduled
63 Pt #2210 A1c overdue · 1 missed appt (90d)
… 4,808 more, ranked
$ pfbulk score-vip diff run_0518 run_0526
↑ 14 patients rose into the top 100 · 9 fell out · rubric unchanged
And because every score is built from named components, any number on that list opens up into its reasons:
No weight is hidden. A clinician who disagrees can point at the exact line — and the rubric is a config file, so disagreement becomes a one-line change, not a rewrite.
06 — What it proved
The boring foundation was the whole win.
The clinic gets a defensible worklist every week, and can ask any score "why?" and get a real answer — which is exactly what makes people trust it enough to act. Just as importantly, the thing they originally asked for is now possible: there's a clean, queryable history and a transparent baseline to measure against. The unglamorous data work didn't postpone the AI goal. It's the only thing that ever made it reachable.
One thing worth being honest about: every weight in that rubric is still a careful guess. There's no ground truth yet for who should have been called. So the clinic started logging what happens after each one — reached, voicemail, appointment booked, declined — captured by the nurses doing the outreach. That data can't be backfilled, so it's being collected now, before it's strictly needed: in a year, it's what turns hand-tuned scores into validated ones.
07 — What's next
Now the model has somewhere to stand.
A v2 scoring engine is in progress — it runs in parallel with v1, explains every rule component, and can diff two runs so a rubric change is never a leap of faith. The cockpit is moving onto a FastAPI + JS foundation built for that growth. And once enough outreach outcomes are logged, the original ask finally has somewhere to stand: machine learning judged against the rule-based baseline and the real engagement data — not replacing the explainability that earned the clinic's trust, but earning its own place on top of it.