website

Inspiration

As 3 seniors at MIT we know all too well the negative effects of SFFA v. Harvard. At MIT enrollment for black and latino students has dropped by 12% and we have been able to see this effect in real time as our cultural group's numbers dwindle before our eyes. We've had several conversations with the MIT admissions office to understand the challenges they face when trying to pick a diverse class of students. We created a solution to some of the pains elite schools experience as they adapt to this new post-affirmative action academic landscape.

What it does

The Problem

To understand diversify you first need to understand how college admissions work. This is somewhat of an oversimplification but most elite schools choose applicants in two batches. The first batch filters for applicants the committee believes can handle the rigor of the institution. This is measured with test scores and grades. This batch can include 10s of 1000s of students so still too many to make up a class however URMs such as blacks and latinos are already grossly underrepresented in this initial selection. Previously admissions offices could use affirmative action to compensate for this underrepresentation of minorities as they prepare the final selection. However, now admissions officers are completely not allowed to consider race even when looking at essays. This is why the diversify algorithm operates between these two batches to help pick a diverse set of candidates from an already qualified pool. Unlike traditional affirmative action, the diversify algorithm skips race and instead optimizes for intellectual diversity with the hypothesis that intellectual diversity will lead to racial diversity. So what does it actually do?

The Solution

The diversify algorithm reads over applicants' essays, personal statements, and supplemental info to create a profile of their experience then compares that to the rest of the pool and assigns each applicant a score indicating how intellectually and experientially different they are from other candidates. This helps admissions officers and recruiters easily identify a benchmark for how diverse of a cohort they are creating. And it does this without directly considering race or breaking any laws.

How does this relate to Skills-based workforce development?

Diversity is crucial to a skill-based workforce. By ensuring that individuals from diverse backgrounds are equipped with the skills needed to contribute, we can address the unique challenges faced by every community. When people from these communities are empowered with the right skills, they are better positioned to develop solutions to the problems they understand best. A prime example is 3 black students being equipped with the tools to solve a problem faced by black students across the nation😉. Additionally, the Diversify algorithm doesn't have to just stop at college admissions. It can also be applied to job applicants to ensure a diverse selection of thinkers in our workforce.

How does this relate to Cybersecurity and risk?

The Cyber Security and Risk challenge: "This category is focused on developing tools that enhance security in digital environments, particularly for protecting communities of color and activists." As students, we love the college admissions use case. However, Diversify is an anti-bias algorithm and can even be extended to setting benchmarks to challenge bias-rigged systems such as loan approvals and job applications.

How we built it

We used nextjs for the website and flask as a Python server to process files. The files are converted into 256-D embedding vectors using the OpenAI embedding API. After we have all the embeddings we use one of two algorithms to select applicants. NullLoader is a custom algorithm our team created. It assigns scores by projecting each embedding vector onto the nullspace of the entire corpus and calculating the norm. However, NullLoader needs a set of essays greater than the dimensionality of the embedding vectors. If the set is smaller we use k-means clustering with k=6 and assign scores based on the cluster size.

How to test

Website

The website it pretty basic since most of our work went into researching and developing the algorithm. However, its pretty easy to use, simply upload a compressed folder of txt files containing the essays to the website where each applicant will be assigned a score that can be searched on the website. You can download an example zip file here

Colab

We highly recommend playing around with the sample data and the notebook in the experiments folder of the git repo. Here you can see how the algorithm helps underrepresented groups hold more representation than they would from a random sampling. You can also find the synthetic_dataset.pkl file in the experiments folder.

Challenges we ran into

Given that personal statements are well...personal there are lots of privacy hurdles when trying to find a data set. We tried to make our own dataset but ultimately couldn't get enough data. Instead, we decided to make a batch request to openai and create a synthetic dataset. The dataset was surprisingly pretty good. I'll also say its interesting walking the line through tech and policy.

Accomplishments that we're proud of

In our google colab experiement, you can see we were able to double the representation of black candidates and quadruple the representation of American Indian and Pacific islander candidates despite them being extremely underrepresented in the initial pool. We compared this to our control which was a random sampling and validated that these were in fact astonishing results.

What we learned

Hyperspace is weird and there's no perfect way to select applicants to have it completely diverse. But this gave us a good start to thinking about ways to address diversity post-affirmative action.

What's next for Diversify

Making a more usable website
Hosting the website
Applying transforms to the latent space to get better results

Its also important to note that due to the nullloader algorithm, the size of the dataset limits the dimensionality of the vector representations. It is possible that with larger datasets with 10s 1000s of applicants because you can increase the dimensionality of the vectors you can better represent the data and get even better results.