ScholarForge: Democratizing Access to Education

Inspiration

We’ve all been there—spending hours scrolling through scholarship websites, only to find that most opportunities don't match our qualifications. "This one requires residency... this one needs a 4.0 GPA... this one is for nursing majors." Then, when you finally find a match, you face the second barrier: staring at a blank document, trying to write your tenth essay of the week.

The statistics are staggering: $7 billion in scholarship money goes unclaimed every year while students graduate with crushing debt. We realized this is caused by two fundamental friction points:

The Discovery Problem: Keyword searches fail to capture a student's nuance (leadership, values, identity).
The Application Barrier: The time cost of writing authentic essays limits students to only 5–10 applications.

We built ScholarForge to bridge this gap, transforming a 20-hour process into a 30-minute workflow.

How We Built It

ScholarForge is a two-part AI system: an Intelligent Discovery Engine and a Strategy-Based Essay Generator.

1. The Discovery Engine (Vector Search)

Traditional search relies on keyword matching, which often fails. We utilized Nomic embeddings to convert both student profiles and scholarship descriptions into high-dimensional vectors.

We store these vectors in ChromaDB. To find the best matches, we calculate the Cosine Similarity between the student vector ($\mathbf{S}$) and the scholarship vector ($\mathbf{O}$). The similarity score $\theta$ is defined as:

$$\text{similarity} = \cos(\theta) = \frac{\mathbf{S} \cdot \mathbf{O}}{|\mathbf{S}| |\mathbf{O}|} = \frac{\sum_{i=1}^{n} S_i O_i}{\sqrt{\sum_{i=1}^{n} S_i^2} \sqrt{\sum_{i=1}^{n} O_i^2}}$$

Where $n=768$ (the dimension of Nomic embeddings). This allows us to mathematically determine how closely a scholarship's "values" align with a student's "experiences," regardless of the specific keywords used.

2. The Essay Generator

We didn't want generic AI text. We scraped 14 real winning scholarship essays and used Anthropic’s Claude Sonnet 4 to reverse-engineer them into 7 distinct rhetorical strategies (e.g., "The Hero's Journey," "Identity & Belonging").

The pipeline works as follows:

Scraping: We use SerpAPI and Zyte to fetch live data from Building-U.com.
Feasibility Filtering: The system analyzes the student's profile against the chosen strategy. If a strategy requires a "hardship narrative" but the student profile lacks relevant data, the system rejects that strategy to maintain authenticity.
Generation: Claude drafts the essay using the student's actual anecdotes mapped to the winning structural framework.

Challenges We Faced

The "Semantic Gap"

Initially, our vector search was inaccurate. A student profile looks very different textually from a scholarship listing. Even if they were a perfect match, their vectors ($\mathbf{S}$ and $\mathbf{O}$) pointed in different directions in the latent space. Solution: We implemented Bidirectional Enhancement. We used an LLM to rewrite the student profile to sound more like a scholarship requirement, and the scholarship description to sound more like a student resume. This aligned the semantic spaces, dramatically improving recall.

The "Robot Voice" Problem

Early prototypes produced essays that sounded like ChatGPT wrote them—perfect grammar, but zero soul. Solution: We moved away from "prompting for an essay" to "prompting for a strategy." By forcing the model to adhere to the structural beats of human-written winning essays, we successfully mimicked human pacing and storytelling.

What We Learned

Vector Recall vs. LLM Precision: Vector search is excellent for filtering 7,000 scholarships down to 50, but it lacks nuance. LLMs are slow but smart. The winning architecture was a hybrid: Vector Search for broad retrieval $\rightarrow$ LLM for intelligent re-ranking.
Authenticity is a Constraint: The most dangerous thing an AI application tool can do is hallucinate experiences. Implementing strict "fact-checking" constraints against the user's JSON profile was critical for ethical usage.
Prompt Engineering is Data Science: We learned that simply "weighting" text by repeating key phrases in the embedding input (e.g., repeating a student's leadership role 3 times) significantly altered the vector weights, allowing us to "tune" the search engine without retraining the model.

Built With

chromadb
claude
json
python
serpapi
zyteapi

Updates

Ethan Yuen started this project — Nov 23, 2025 01:22 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.