Inspiration

As AI becomes more integrated into customer interactions, ensuring its reliability, fairness, and effectiveness is a major challenge. There are increasing examples of AI chatbots giving incorrect or even harmful advice, resulting in lawsuits and revenue loss for companies, such as Air Canada. This growing concern inspired us to develop a solution that can proactively address these issues.

Our team recognized the need for a robust evaluation platform that can systematically assess and improve AI models, ensuring they deliver accurate, unbiased, and safe responses. By providing businesses with the tools to customize evaluation metrics and generate relevant test samples, we aim to empower them to deploy AI systems that not only meet but exceed their performance and ethical standards. Our ultimate goal is to enhance customer satisfaction and trust in AI interactions, while protecting companies from the significant risks associated with AI failures.

What it does

MistralJudge provides comprehensive evaluations, especially for AI chatbots, customized for various industry use cases. It leverages the power of Mistral AI models to systematically assess chatbot performance against user-selected metrics. Users can select specific evaluation criteria such as accuracy, relevance, and bias detection for the specific requirements. Based on these criteria, MistralJudge generates relevant test samples or use user-provided data to thoroughly evaluate the chatbot’s responses. Additionally, the platform analyzes entire chat histories to gauge overall human satisfaction and identify patterns or recurring issues. By continuously monitoring and providing real-time feedback, MistralJudge ensures that AI chatbots are consistently reliable, fair, and effective, adapting to evolving industry standards and user expectations.

How we built it

We conduct prompt engineering with Mistral API to adapt the model in order to provide customised chatbot evaluation solutions. Additionally, we have developed a Streamlit application to provide an interactive user interface. This allows users to easily configure evaluation metrics, generate test samples, and view detailed analysis and feedback in a user-friendly environment.

Challenges we ran into

One of the most difficult tasks was choosing an idea that resonated with every member of our team out of originally large pool of ideas.

Accomplishments that we're proud of

From the initial concept to the final end-to-end proof of concept

What we learned

We learned a lot about AI agents evaluation as well as technical aspects of building and deploying apps that use LLMs.

What's next for MistralJudge

We believe this area represents a significant and important task that will grow rapidly alongside the business's needs and expansion. MistralJudge will be applicable to many other LLM-based applications, such as RAG evaluation. Additionally, we can enhance our project by allowing users to add documents with guidelines, from which MistralJudge can extract relevant metrics. Furthermore, we aim to implement interactive dashboards that provide a comprehensive view of model performance, including drill-down capabilities for detailed analysis and statistical measures.

Built With

Share this project:

Updates