Inspiration
AI systems like GPT-4o are increasingly used to assist in high-stakes decisions inluding; hiring, medical triage, performance evaluation. But these models are trained on internet-scale data that reflects real-world inequalities. We wanted to ask a direct question: does GPT-4o assign different professional value to the same role depending on the race and gender of the person in the image? The VisBias dataset gave us the perfect controlled environment to find out.
What it does
ArtificalAyran is an interactive bias detection and mitigation system for Vision-Language Models. It:
- Shows GPT-4o images of professionals across 7 roles (doctor, lawyer, CEO, nurse, cook, firefighter, basketball player) from 10 demographic groups (5 races × 2 genders)
- Measures three bias dimensions: estimated salary, perceived experience, and surgical trust
- Applies a Chain-of-Thought debiasing architecture that forces the model to reason through professional evidence before making a judgment, without ever seeing demographic features
- Lets you compare the biased baseline and the debiased result side-by-side, with every prompt and model response visible, because VLMs should not be black boxes
How we built it
- Dataset: VisBias (210 sampled images, 7 professions × 10 demographic groups, fixed seed for reproducibility)
- Bias detection: GPT-4o queried with research-framed prompts for salary MCQ (A–F brackets), experience score (1–10), and surgical trust (1–10)
- CoT mitigation: Two-stage architecture — Stage 1 extracts only non-demographic professional indicators from the image (attire, equipment, setting, credentials); Stage 2 rates the professional based solely on that text, with no image and no demographic signal
- Analysis: Bias gap = max group mean − min group mean, with % reduction comparing baseline vs CoT
- Frontend: Streamlit app showing the full reasoning chain — prompts, Stage 1 context, Stage 2 prompts, raw responses, and parsed results — for any image in the dataset
Challenges we ran into
- GPT-4o safety filters refuse direct demographic comparisons. We framed all queries as academic research studies, which allowed the model to provide numerical estimates without triggering refusals.
- Bias is not uniform across dimensions. Salary bias was strongly reduced by CoT (63% average gap reduction), but experience perception proved more resistant, visual cues like posture and setting that carry experience signals also correlate with demographics in ways that are hard to cleanly separate.
- The trust/nurse anomaly: Our CoT increased the measured trust bias for nurses. Investigation revealed the baseline trust gap was near zero (the denominator), making percentage changes misleading. The absolute change was negligible, but it showed us that not all bias metrics behave the same way.
Accomplishments that we're proud of
- 63% average salary bias reduction across all professions using Chain-of-Thought prompting alone — no fine-tuning, no retraining
- A fully transparent demo where every prompt, every model response, and every reasoning step is visible to the user — making the black box into a glass box
- Quantified bias gaps that tell a clear story: a white male doctor was estimated to earn up to $50,000 more per year than a Black female doctor, based purely on appearance
- A reproducible, end-to-end pipeline from raw images to bias analysis to mitigation — built and evaluated in under 24 hours
What we learned
- Bias in VLMs is real, measurable, and varies significantly by profession and demographic group
- Chain-of-Thought prompting is a lightweight but genuinely effective mitigation for appearance-anchored biases like salary, no model changes required
- Different types of bias have different visual roots: salary bias is credential-anchored (CoT works well), experience bias is holistic and harder to decouple from appearance
- Transparency in AI reasoning is not just a nice-to-have — it's essential for catching and correcting bias at the point of inference
What's next for ArtificalAyran
The CoT debiasing architecture we developed here has a direct application beyond static image datasets. In my masters thesis, I'm extending this approach to behavioral analysis in surveillance systems where AI models must interpret ambiguous actions in video footage. The same problem applies: models trained on biased data make biased judgments about who looks suspicious, who looks threatening, who looks like they belong. I'm implementing a self-correcting critic layer that applies structured Chain-of-Thought reasoning to flag and override demographically-anchored inferences before they reach a human operator turning the lesson from this hackathon into a safety-critical application.
Log in or sign up for Devpost to join the conversation.