Inspiration

Aella's post, "The Vision of Slutcon," if you ignore the sexual context, has an interesting point that can be generalized. "If you’re learning piano, you immediately hear when you press a wrong key; if you’re rock climbing, you know how far up you are on the wall. You experiment, adjust, until you figure out how to get closer to your goal. But when it comes to being attractive to women, men are collectively shit out of luck; you get feedback sometimes, ... if a woman says yes to you asking her on a date. But most of the time your behavior is causing concrete, visceral impacts in women that they will never tell you about. It’s like trying to learn how to play a piano with earplugs in. All your moves fall into some great void."

There are a lot of things that we want to get better at, but can't because we don't implicitly know if we're doing well or poorly. We're monkeys throwing darts with blindfolds on.

The classical way to fix this problem is to hire a "coach" for whatever task we want to do. This tends to be expensive and difficult. People don't like admitting that they're bad at something and need help; they'd rather keep blindly throwing darts and imagine that they are getting better at something.

Thus, the premise for this project: what if we leveraged coaches' implicit knowledge about a field to build an automatic feedback detector for obscure-feedback tasks?

What it does

Our goal is to approximate some function f that maps continuous (in the future, multimodal) input to a "fitness score." When we detect a confident, dramatic (high +/-) change in this fitness function, we alert the user and choose a feedback message that tells them what triggered it.

Since the set of inputs is ill-defined and inconsistent between topics, we want to approximate this function implicitly. There isn't a way to get a good labeled dataset for this (because it is so general), so we (would) croudsource experts (due to the 24hr limitation, we set this function manually but provide a way for experts to make their choices if we had access to the users) by giving them a choice between 2 similar inputs and ask them to choose one (with some weight w) and provide a reason why they weight one over the other.

The Bradley-Terry model is great for deriving a ranking from pairwise choices (think a chess ELO system). It estimates the probability that A beats B by the function:

$$ P(A \succ B) \;=\; \sigma(s_A - s_B) \;=\; \frac{1}{1 + e^{-(s_A - s_B)}}. $$

where: \(\sigma(z)=\frac{1}{1+e^{-z}}\)

(In our code, we weigh each person's input by multiplying their score weighting by \(\alpha_r\))

We don't really care about the probability that A beats B, since we already have a pair (A, B) and a winner. We really want each of the latent scores, the \(s_i\) for both A and B.

To do this, we start each score at zero and use our Bradley-Terry model to forecast the win probability. We find some error when compared to our expected result, weigh it by our \(\alpha_r\)), and update both \(s_i\).

We then turn each of the scores into probabilities that roughly reflect how "good" a given action is. When our general system sees an action with a probability of 20%, we know that roughly 80% of the time, that action is bad, and we should provide feedback about it.

To turn our scores into probabilities, we use Platt scaling to get a probability (a number between 1 and 0) (if we wanted something not normalized to 1, 0, we would set d = max * d + min)

$$ \hat p \;=\; \sigma(z) \;=\; \frac{1}{1+e^{-(d)}}, $$

Now that we have a list of probabilities for each element, we want to turn that into an "action potential" to fire off and alert the user in real time (thus, immediate, unambiguous feedback!).

Action potentials in the brain work by "spiking" after a certain threshold has been met. We set thresholds modeling this for both positive and negative responses (customizable in config.py, \(\tau_{\text{pos}} = 0.8\), \(\tau_{\text{neg}} = 0.2\)

We have three states. Positive, negative, and neutral.

We also add features here like persistence (we wait to report positive until we've seen N [we use 2] consecutive checks) and hysteresis (setting exit boundaries to be lower than the entries, exit positive if the probability dips below .7)

When we see the flags, we alert the user with an associated reason (from our experts)!

How we built it

Tragically, I am typically very anti-LLM for coding. During this hackathon, since my partners who knew web development dropped out, I decided to give vibe-coding a shot for building the frontend/backend interactions so I could supervise and direct a high-level. (The "we" that I keep using in this refers to me and chat)

I used Python for the Bradley-Terry model training, the math, and the streaming. Since I plan to scale this app after HackTX, I synced Cloudflare to my frontend.

I have no idea how the frontend works (at the time of writing this), but by the end, hopefully, I will gain a good enough understanding to make it moderately visually appealing.

Challenges we ran into

Cloudflare isnt working

Accomplishments that we're proud of

What we learned

What's next for EVDojo

We want to add a classifier layer to bucket our A/B tests so we can generalize the website to be multimodal. (Easy to scale, multiply classifier confidence by our internal probability for each element for responses. If I had to guess, the biggest challenge here would probably be coming up with discrete buckets for each input label.)

Get people to train on the platform! I'm sure Aella would promote this specifically for the "finding love" task, and maybe we can ride that popularity to something more general!

Share this project:

Updates