Inspiration
Many students studying Machine Learning turn to Large Language Models for explanations of the core concepts. However, LLMs tend to provide answers independent of context, limiting their ability to serve as study tools. Additionally, LLMs may "hallucinate", misquote or fabricate information. This motivated the creation of a learning platform that enables LLMs to provide answers with direct references to textbook and lecture materials.
What it does
MLChat is a learning platform that uses Retrieval Augmented Generation to answer machine learning questions with direct references to (publicly available) textbook material. Each answer includes a hyperlink to a relevant textbook chapter that is directly readable as a PDF within the app. Additionally, it may include a direct quote from the textbook chapter.
How I built it
MLChat is implemented as a Retrieval Augmented Generation system. It uses the mixedbread-ai embeddings API to vectorize the user's question, retrieves relevant textbook material from the Pinecone vector database, and generates a response using the Google Gemini LLM.
Challenges I ran into
I found prompt engineering to be surprisingly challenging, instructing the system to follow the response format while being conversationally robust. Additionally, creating a data pipeline that cleanly splits textbook PDFs into 500-token plaintext documents was a challenge.
Accomplishments that I'm proud of
I'm proud of MLChat. It is an extremely useful learning platform that equips the state-of-the-art Google Gemini LLM with the knowledge of 5 textbooks, providing students accurate answers with relevant sources. Additionally, I'm proud of how smoothly the development process went, and how clean and modular the code is.
I'm also proud of having spotted 2 syntax errors in the Python documentation for the mixedbread-ai API, resulting in a pleasant dialogue with the company's CEO.
What I learned
This was my first experience using every tool in the tech stack, except for Angular. Thus, I'm proud of having learned a number of new tools, including Google Gemini, Google Cloud Run, Pinecone, mixedbread-ai, Docker, and FastAPI.
What's next for MLChat: A Learning Platform
In its current state, MLChat is ready to serve students around the world, providing accurate answers with relevant textbook material. Of course, there is always room for improvement.
MLChat uses a knowledge base consisting of plain-text documents with Latex-encoded symbols. This knowledgebase was created using an ETL pipeline that automatically separated documents into 500-token chunks, some of which separated paragraphs unevenly. I will improve this pipeline to cleanly separate documents into chunks, resulting in more accurate and relevant responses. Afterwards, I hope to deploy it as a standalone application, enabling developers to easily create their own RAG tutor with their documents.
MLChat, as an information retrieval system, generates useful data that can be used to perform analytics, improve its performance, and train further models. Thus, there is great value in adding user feedback functionality and developing an analytics pipeline.
Built With
- angular.js
- docker
- fastapi
- google-cloud
- google-gemini
- mixedbread-ai
- pinecone
- python
- typescript
Log in or sign up for Devpost to join the conversation.