Inspiration AI adoption is moving faster than most teams’ ability to measure trust, safety, and consistency. We wanted to build something that feels like CI/CD for AI: a platform that helps agencies and developers compare models side-by-side, evaluate them using TEVVRL-style metrics, and understand which models are actually worth using.
What it does AIMetrics is an AI evaluation dashboard that compares LLMs like GPT, DeepSeek, Gemini, and a USDA-tuned model across Business, Science, Healthcare, Math, and Art prompts. It scores models across Test, Evaluation, Verification, Reliability, and Leniency, calculates a composite validation score, shows which model performs better, and visualizes current versus average performance over time. It also includes a backend API for model comparison and trend retrieval.
How we built it We built AIMetrics as a full-stack React application with a Node.js backend. The frontend uses React, Tailwind CSS, and a custom dark visual system, plus a Three.js-powered hero visualization for a more interactive experience. The backend exposes API endpoints for health checks, model metadata, trend retrieval, single-model evaluation, and head-to-head comparison. We also centralized the scoring logic into a shared eval engine so both frontend and backend use the same TEVVRL comparison rules.
Challenges we ran into One challenge was making the project work both locally and on GitHub Pages. Locally, the frontend could talk to the backend through Vite’s dev proxy, but GitHub Pages only supports static hosting, so the deployed frontend could not access the Node backend. We solved that by adding a static fallback mode for Pages while also preparing the backend for separate deployment. Another challenge was making the 3D hero feel meaningful instead of decorative, which led us to redesign it around actual comparison data.
Accomplishments that we're proud of We’re proud that AIMetrics became more than a mockup. It now has a real backend API, shared scoring logic, multiple AI model options including Gemini, a usable comparison flow, live-looking trend charts, and a data-driven 3D visual that actually explains the comparison instead of just looking flashy. We also improved the deployment flow so the project can support both static hosting and a separately deployed backend.
What we learned We learned that AI evaluation is not just about accuracy. Trust also depends on repeatability, leniency, reliability, explainability, and how a model performs in different domains. We also learned how important deployment architecture is for AI tools: a frontend may look complete locally, but real deployment constraints like static hosting and backend availability change how the experience has to be designed.
What's next for AIMetrics Next, we want to connect AIMetrics to real model providers instead of simulated baselines, store evaluation history in a database, add authentication and saved reports, and support larger automated benchmark suites. We also want to deploy the backend publicly so the GitHub-hosted frontend can run in live API mode, and keep refining the visual analytics so teams can spot drift, regressions, and safety issues faster.
Built With
- javascript
- node.js
- react
- tailwind
- three.js
- vite
Log in or sign up for Devpost to join the conversation.