Evalpedia

Schedule Evals
View Current and Past Jobs
Manage Logs
View Traces
Apify MCPS Used

Inspiration

We were inspired by people telling us agents are unpredictable in production, especially when using tools and operating autonomously for long time periods. Agents often choose the wrong tools, are sensitive to system prompts, and hard to evaluate consistently. Existing tools like Langchain require lots of code and are difficult to setup, while platforms like MLFlow and Huggingface specialize in infrastructure for model weights and datasets instead of evals.

What it does

Select a model, write a system prompt, and choose an eval. Specify evaluation parameters like temperature, epochs, and number of samples, and run eval jobs. For example, one evaluation is finding the cheapest iphone 15 pro, and another is doing competitor webpage analysis. We use llm as judge to grade model performance. Once a job is done running, click logs to view detailed agent traces

How we built it

We used Apify actors to scrape Google, Facebook Marketplace, Amazon, and generic Website Content. The backend uses FastAPI with a task queue for long running evals and the frontend is nextjs 15.