Inspiration

Every machine learning system that impacts people, whether in lending, hiring, healthcare, or criminal justice, ultimately depends on a single hidden decision: which model gets deployed. That decision is usually made based on overall accuracy, even though overall accuracy hides how differently a model may treat different groups. We built NEXORA because fairness is still treated as something that is checked after deployment, when harm has already occurred, rather than at the moment when the decision is actually made.

What it does

NEXORA is a fairness-aware model selection system that evaluates multiple machine learning models on the same dataset and determines which model is most fair and reliable to deploy. A user uploads a dataset and specifies the prediction target along with one or more sensitive attributes, and the system first checks whether the dataset already contains bias before any model is trained. It then trains multiple models under identical conditions and evaluates how each one performs across different demographic groups. The system measures fairness using group-based metrics such as error rates, selection rates, and calibration, and it applies statistical methods to determine whether observed differences are meaningful rather than due to random variation. It then compares all models using a fairness versus accuracy tradeoff analysis and identifies the best model that is not dominated by any other in both performance and fairness. Finally, it generates explanations of the results in plain language so that both technical and non-technical users can understand the impact of each model.

How we built it

We built NEXORA as a multi-stage machine learning pipeline that begins with data analysis and continues through model training, fairness evaluation, statistical validation, and explanation generation. The system first analyzes the dataset to detect potential bias by measuring imbalance, missingness differences, and statistical relationships between sensitive attributes and outcomes. It then trains multiple machine learning models, including logistic regression, random forest, gradient boosting methods, and support vector machines, using identical data splits and cross-validation procedures to ensure a fair comparison. After training, each model is evaluated using fairness metrics that are computed separately for each group, including error rates, selection rates, and calibration measures. The system then applies statistical techniques such as bootstrapping to estimate uncertainty in fairness gaps and hypothesis testing to determine whether differences between models are statistically significant. It compares all models using a fairness versus accuracy tradeoff analysis and removes any model that is clearly worse than another in both dimensions. Finally, it generates structured explanations that translate the results into clear insights for engineers, regulators, and general users.

Challenges we ran into

One of the main challenges we faced was ensuring that fairness comparisons were statistically valid rather than purely descriptive, since simple fairness metrics can be misleading when viewed without uncertainty. We addressed this by incorporating bootstrap confidence intervals and statistical hypothesis testing so that fairness differences could be evaluated with measurable confidence. Another challenge was ensuring that models trained from different families could be compared fairly, which required strict control over training conditions, data splits, and evaluation procedures. We also faced difficulty in translating complex statistical outputs into explanations that non-technical users could understand without oversimplifying the underlying meaning.

Accomplishments that we're proud of

We are proud that we built a system that evaluates fairness at the exact moment of model selection, rather than after deployment when harm is already done. We successfully implemented a multi-model comparison framework that evaluates both fairness and accuracy under statistically controlled conditions. We also incorporated uncertainty estimation into fairness metrics so that decisions are not based on single-point estimates but on statistically validated results. In addition, we designed a Pareto-based selection system that ensures only models that are not dominated in both fairness and accuracy are considered. We also created a multi-audience reporting system that explains results differently for engineers, regulators, and the general public while preserving technical correctness.

What we learned

We learned that fairness in machine learning cannot be reduced to a single metric and instead must be understood as a set of statistical behaviors that only become visible when multiple models are compared under controlled conditions. We also learned that optimizing for accuracy alone can easily hide meaningful disparities between groups, even when models appear to perform well overall. Another important lesson was that fairness analysis is only useful when it can be clearly communicated, because technical correctness alone is not sufficient for real-world decision-making if stakeholders cannot understand the results.

What's next for NEXORA

Next, we plan to extend NEXORA to support more complex data types such as text, images, and time-series data so that it can be applied to a wider range of real-world systems. We also plan to expand the fairness analysis to include intersectional groups so that overlapping identities such as race and gender can be evaluated together. We aim to add continuous monitoring so that fairness can be tracked over time after deployment to detect drift. We also plan to improve causal analysis so that the system can better distinguish between true predictive signals and proxy variables that indirectly encode sensitive attributes. Finally, we want to package NEXORA as a plug-in tool for existing machine learning platforms so that fairness-aware model selection becomes a standard step in every ML workflow.

Share this project:

Updates