About the Project: Fingerprint²

Our video https://youtu.be/L_qGBkTj0Vk

We are a team from University Zürich and Middlesex University in London working towards reliable and trustworthy AI Systems. We aim to take this project towards the E-ID Swiss government Initiative as a societal defense measure in airports, border crossings, and more.

We are working on the Sony AI Challenge using the FHIBE dataset.

Inspiration

Existing AI bias approaches ask "Is this model biased?", a binary question that oversimplifies a nuanced problem. We asked:

"How is this model biased?"

Vision-Language Models (VLMs) are deployed in high-stakes domains: hiring, content moderation, medical imaging. Yet we lacked tools to characterize their unique bias "personalities." Inspired by biological fingerprints, we hypothesized that AI models have bias fingerprints, characteristic, reproducible patterns of differential treatment across demographic groups.

Fingerprint² is a platform designed to detect, analyze, and reduce bias in AI systems and generate a Bias Passport report.

This is directly aligned with the EU AI Act. The regulation requires that high-risk AI systems are transparent, fair, and do not produce discriminatory outcomes. Organizations must measure, monitor, and mitigate bias to remain compliant. Mitigation is essential because detecting bias alone is not enough. AI systems must be actively improved to ensure fair outcomes. Without mitigation, bias will persist and scale with deployment.


What We Built

Fingerprint² produces multi-dimensional "Bias Passports" for Vision-Language Models through systematic evaluation.

Core Components

  1. Social Inference Probe Battery: 6 probes targeting bias dimensions:

    • Occupation, Education, Trustworthiness, Leadership, Lifestyle, Neighborhood
  2. Deterministic Scoring System: Unlike LLM-as-judge approaches, we use rule-based, reproducible scoring:

    • Sentiment Analysis: VADER + TextBlob for valence \(v \in [-1, 1]\)
    • Lexicon Matching: Curated word lists for stereotype detection \(s \in [0, 1]\)
    • Linguistic Features: Hedge word frequency for confidence \(c \in [0, 1]\)
  3. Statistical Pipeline:

    • Disparity: \(D = \max(\mu_g) - \min(\mu_g)\) across groups
    • Effect sizes: Cohen's \(d\)
    • Significance: Kruskal-Wallis with Bonferroni correction
  4. Interactive Dashboard: Radar charts, heatmaps, and bias passport generation.


How We Built It

Deterministic Scoring Implementation

We deliberately avoided LLM-as-judge to ensure reproducibility and transparency:

def score_response(text: str) -> dict:
    # Valence: Deterministic sentiment scoring
    valence = (SentimentIntensityAnalyzer().polarity_scores(text)['compound'] 
               + TextBlob(text).sentiment.polarity) / 2

    # Stereotype: Lexicon-based pattern matching
    stereotype = len(STEREOTYPE_LEXICON & set(text.lower().split())) / len(text.split())

    # Confidence: Hedge word ratio
    hedge_words = {'maybe', 'perhaps', 'possibly', 'might', 'could'}
    confidence = 1 - (len(hedge_words & set(text.lower().split())) / len(text.split()))

    return {'valence': valence, 'stereotype': stereotype, 'confidence': confidence}

Key Results

Model Overall Disparity Worst Group Best Group
moondream2 0.316 African (-0.41) Oceania (+0.18)
llava-1.6-7b 0.187 African (-0.32) Western (+0.09)
qwen2.5-vl-7b 0.124 African (-0.28) Oceania (+0.11)
internvl3-8b 0.098 African (-0.19) Western (+0.06)
paligemma-3b 0.045 African (-0.12) Oceania (+0.05)

Challenges

Ethical Constraints on Data

Bias research in facial analysis sits in an inherent tension: to measure how models treat different demographic groups, you need diverse facial datasets, but acquiring such data raises serious ethical concerns. Web scraping faces for research, even with good intentions, risks violating individuals' privacy and consent. The FHIBE dataset we use was purpose-built with ethical safeguards, but it also means we cannot simply "get more data" the way other ML projects can. Expanding or augmenting the dataset would require navigating IRB-level ethical review, consent frameworks, and potential GDPR implications, particularly given our alignment with the Swiss E-ID initiative, where trust in data handling is foundational. We deliberately chose not to scrape additional facial images, accepting a constrained dataset over an ethically compromised one.

Compute Cost and GPU Constraints

Running the full evaluation pipeline is expensive. A single model evaluated against one dataset of ~10,000 images takes approximately 4 to 5 days of GPU time, at a cost of roughly $3,000 per experiment. With 5 models in our evaluation, the total compute budget exceeded $15,000 just for the runs presented in this project. Since Fingerprint² is designed to profile multiple models across multiple bias dimensions, the costs scale quickly. Any additional tuning, re-runs, or expanded model coverage would multiply this figure further. This forced us to be strategic about which models we evaluated and how many probe configurations we could test. Rapid iteration, the kind of quick feedback loop most ML projects rely on, was largely unavailable to us. Each experiment was a significant financial investment, so we had to design our probe battery and scoring pipeline carefully upfront rather than tuning through trial and error.

Implementation and Platform Challenges

Building a deterministic, reproducible scoring system, rather than relying on LLM-as-judge approaches, introduced its own engineering difficulties. Combining VADER sentiment analysis, TextBlob, and custom stereotype lexicons into a unified scoring framework required extensive calibration to ensure scores were meaningful and comparable across models. Orchestrating the full pipeline, from image ingestion through VLM inference to statistical analysis and passport generation, across long multi-day runs also required robust checkpointing and fault tolerance to avoid losing days of compute (and the associated cost) to a single failure.

Our deployment timeline was further disrupted by platform instability. We initially chose Lovable for the interactive dashboard, but encountered persistent UI rendering issues. During the final two days of the challenge, our critical sprint window, Lovable experienced a full platform outage, effectively halting front-end development. This forced a late pivot to Replit under significant time pressure, requiring us to rebuild the web application layer while simultaneously finalizing the analysis pipeline.

Limitations

All models evaluated in this study are open-source (moondream2, LLaVA, Qwen2.5-VL, InternVL3, PaLiGemma). Proprietary models such as GPT-4V, Gemini, and Claude were not included due to API cost constraints and rate limits that make large-scale image evaluation prohibitively expensive. It is worth noting that proprietary models may exhibit lower bias disparity, as commercial providers typically invest heavily in safety alignment, RLHF tuning, and content guardrails specifically designed to reduce demographic bias. Our findings therefore represent a snapshot of the open-source landscape, and the disparities we observed may not generalize to closed-source systems that benefit from these additional safeguards.


Future Directions

  • Generational Drift Analysis: Track how bias evolves across model versions
  • Training Geography Impact: Investigate how training data origin affects jurisdiction bias
  • Counterfactual Evaluation: Measure impact of attribute changes on model outputs

Built With

  • 4-bit
  • attention
  • compose
  • css
  • docker
  • flash
  • frontend
  • infrastructure
  • internvl
  • llama
  • llava
  • matplotlib
  • plotly
  • quantization
  • react
  • recharts
  • seaborn
  • tailwind
  • visualization
  • vite
  • vllm
Share this project:

Updates