Patients Like Me

Treatments that helped
Patient Experience Ranker
Related conditions
Recomended Doctors
GIF
Spectrum of data and general treatment signals

slide deck at https://patients-like-me-080boq1.gamma.site/

Inspiration

Long COVID is treated as one disease. It isn't.

The same diagnosis is given to someone with racing-heart orthostatic
intolerance (POTS), someone with histamine flares and food reactions
(MCAS), and someone bedbound by post-exertional crashes (ME/CFS).
They get lumped into one bucket, handed the same generic advice, and
told research is "still early." Meanwhile, the people living it have
already run thousands of n-of-1 experiments and written the results
down in patient forums.

We were inspired by a simple observation: the evidence patients need
already exists; it's just trapped in unstructured stories. A
newly-sick person asking "will low-dose naltrexone help me?" has to
read 400 Reddit threads to guess. We wanted to turn that lived
experience into something you can actually query: what happened to
patients like me?

What it does

You tell Patients Like Me about yourself — your conditions (POTS, MCAS, ME/CFS, PEM, EDS, dysautonomia, small-fiber neuropathy, fibromyalgia) and your functional status — and it pulls a real cohort of patients who share your phenotype, then answers three questions:

- Treatments: which drug classes actually gave people like you a
positive experience, ranked, each with a 95% confidence interval and
a real-time, profile-specific explanation grounded in the actual
patient quotes (not generic mechanism text).
- Diagnoses: which other conditions are enriched in your cohort —
co-occurrence patterns that are prompts to raise with a clinician,
never a diagnosis.
- Doctors: which clinicians and researchers patients like you
actually mention.

Every number is backed by real reports, real denominators, and a
visible "what's missing." It is decision support from lived
experience, explicitly not medical advice.

How we built it

A pipeline from raw forum text to calibrated, queryable evidence:

1. Extraction. ~35k posts from ~4,300 patients, run through an LLM +
spaCy NER to pull structured conditions, treatment_reports (with
sentiment), and practitioner mentions into a SQLite corpus.
2. Phenotype matrix. We multi-hot encoded ~95 fields into a patient ×
feature matrix and asked the obvious question: are there clean
subtypes?
3. Per-drug-class outcome models. For each treatment class we fit a
logistic model on corrected, self-reported outcomes:
$$
\text{logit} = \mathbf{x}^\top \boldsymbol\beta, \qquad p =
\sigma(\mathbf{x}^\top\boldsymbol\beta) =
\frac{1}{1+e^{-\mathbf{x}^\top\boldsymbol\beta}}
$$
with confidence intervals via the delta method on the coefficient
covariance $\Sigma$:
$$
\text{CI}_{95%} = \sigma!\left(\mathbf{x}^\top\boldsymbol\beta \pm
1.96\sqrt{\mathbf{x}^\top \Sigma, \mathbf{x}}\right)
$$
4. Comorbidity enrichment. For the "diagnoses" view we rank
co-conditions by lift:
$$
\text{lift}(c) = \frac{P(c \mid \text{cohort})}{P(c \mid
\text{population})}
$$
5. Serving + explanations. A FastAPI backend
(/api/treatment-evidence, /api/comorbidity, /api/explain) and a
"Patients Like Me" frontend. Explanations are generated by an LLM
(xAI Grok / OpenRouter) grounded in the retrieved cohort quotes, with
a deterministic fallback when no key is set.

Challenges we ran into

- There were no clean clusters. We expected to find 3–4 crisp Long
COVID subtypes. Instead the data is a continuum — conditions overlap,
patients sit on a spectrum. So we pivoted from hard clustering to
$k$-NN similarity ("patients like you") and per-axis modeling, which
is the honest representation.
- The verbosity confound. Presence-encoding meant a 0 could mean
"doesn't have it" or "didn't mention it." Patients who wrote more
looked similar to each other regardless of phenotype — PC1 correlated
with fields-filled at $r \approx +0.95$. We fixed it with IDF
weighting and $L_2$/cosine normalization, collapsing the confound to
$r \approx +0.02$ while preserving real clinical signal.
- The LLM mislabeled sentiment. Our first sentiment pass over-called
positive (~20–25% false positives, asymmetric — negatives were
reliable), inflating how well everything "worked." We diagnosed it
with a manual audit, rewrote the prompt (attribution required, added
a no-effect class), and re-ran — then rebuilt all downstream models
on the corrected labels.
- Keeping patient data controlled while still shipping a demo: DBs
live in controlled storage, pulled at runtime, never committed to the
public repo.

Accomplishments that we're proud of

- A working, end-to-end product that turns messy forum text into
calibrated, uncertainty-aware treatment predictions — confidence
intervals, not false precision.
- Honesty as a feature: we measured and corrected our own confounds
(verbosity, sentiment) instead of shipping a prettier-but-wrong
number.
- Quote-grounded explanations — the "why might this help?" text
paraphrases what real patients actually reported, good and bad,
rather than hallucinating a mechanism.
- Reframing the whole experience around "patients like me," so a sick
person sees their own journey reflected back, not a dashboard.

What we learned

Distrabution seems to be less in extreme clusters and more along a spectrum. We went into this expecting to use kNN and being able to make some sort of clear treatment tree: we found that normal assumptions and regression were much more explanatory.