Inspiration

We were inspired by people telling us agents are unpredictable in production, especially when using tools and operating autonomously for long time periods. Agents often choose the wrong tools, are sensitive to system prompts, and hard to evaluate consistently. Existing tools like Langchain require lots of code and are difficult to setup, while platforms like MLFlow and Huggingface specialize in infrastructure for model weights and datasets instead of evals.

What it does

Select a model, write a system prompt, and choose an eval. Specify evaluation parameters like temperature, epochs, and number of samples, and run eval jobs. For example, one evaluation is finding the cheapest iphone 15 pro, and another is doing competitor webpage analysis. We use llm as judge to grade model performance. Once a job is done running, click logs to view detailed agent traces

How we built it

We used Apify actors to scrape Google, Facebook Marketplace, Amazon, and generic Website Content. The backend uses FastAPI with a task queue for long running evals and the frontend is nextjs 15.

Challenges we ran into

There were some issues with Apify Actors such as long timeouts, inconsistent API schemas, and inconsistent results.

Accomplishments that we're proud of

The frontend UI is cool, the jobs queue works, the log viewer is nice

What we learned

Evals are hard but cool

What's next for Evalpedia

Sell to OpenAI and raise $10M in VC money

Built With

Share this project:

Updates