Inspiration
As a high school student, I struggled to find universities that matched both my interests and the facilities they offered. This inspired me to build a curated system that recommends universities with strong programs based on a user’s interests.
What it does
The user provides an interest (for example, 'I like coding'), and the system processes the input to classify it into a relevant category, then returns a list of matching majors and recommended colleges.
How I've built it
This project was built using HTML, CSS, and JavaScript. The frontend provides a simple interface for user input and result visualization. The backend logic is implemented in JavaScript and uses a lightweight Natural Language Processing (NLP) approach based on TF-IDF (Term Frequency–Inverse Document Frequency) combined with cosine similarity to compare user input with predefined domain-specific text data. Instead of using heavy pre-trained models, a custom dataset of academic interest categories was created.
Model Details
The system uses a TF-IDF + Cosine Similarity text classification model. A custom dataset was manually created for this project.
Structure:
Each academic domain contains:
A list of interest phrases (training text examples)
- A list of relevant majors
- A list of recommended universities
Domains included:
- Medicine & Healthcare
- Computer Science & Coding
- Law & Legal Studies
- Arts & Design
- Music & Performing Arts
- Space Science & Astronomy
Example training phrases:
- “software engineering and web development” - Coding
- “clinical medicine and patient care” - Medicine
- “rocket science and space exploration” - Space
Training Process:
This project does not use heavy model training. Instead, it follows a statistical text modeling process:
- The dataset is preprocessed into tokens.
- A vocabulary of domain-specific terms is constructed.
- TF-IDF values are computed to determine word importance.
- Text vectors are generated for both dataset entries and user input.
- Cosine similarity is used to compare input with each category.
- The category with the highest similarity score is selected as output.
Evaluation Metrics
The model’s performance was evaluated using a Confidence score and Manual testing. In the Confidence score, the cosine similarity score is used as a confidence indicator for each prediction and for Manual testing, the model was tested with different sentence structures to ensure it could handle natural language variations rather than fixed keywords.
Challenges I ran into
At the beginning of the project, I explored multiple machine learning approaches including Naive Bayes, ml5.js, and TensorFlow.js to build the classification model. However, I encountered several issues related to implementation complexity, library compatibility, and persistent runtime errors. After multiple iterations and debugging attempts, it became clear that the model performance was not stable due to the limitations of the approach combined with an inefficient and inconsistent dataset. The training data was not well-balanced across categories, which led to biased and unreliable predictions. This experience helped me understand that in machine learning, the quality and structure of the dataset is just as important as the choice of model. As a result, I shifted to a simpler and more controlled TF-IDF-based approach, which significantly improved stability and interpretability.
Log in or sign up for Devpost to join the conversation.