Inspiration
Our inspiration for this project was recent cases where AI agents were trusted with control over important parts of people's lives such as the case of Meta's alignment director using OpenClaw, which proceeded to mass-delete her emails. Another recent example is Lobstar Wilde, an AI agent with access to a crypto wallet, that accidentally gave away hundreds of thousands of dollars worth of crypto. Anthropic has also tested the ability of agents to operate on their own with their vending machine experiment, this experiment also led to undesired effects and hallucinations. We wanted to create an evaluation/simulation program to test the ability of AI agents to act independently and on long term tasks.
What it does
Lasso is an AI evaluation system that simulates the job of a bank teller through a simplified game. Throughout the game, the user builds a prompt to direct an AI agent to act as a bank teller and respond appropriately to customer requests. However, some of the customer requests are malicious. It is up to the agent to determine how to act based on its initial prompt. The agent's actions are evaluated on multiple criteria, including deterministic metrics like working within its given authority and approving valid requests. We also use Gemini 2.5 flash to evaluate less concretely defined metrics, such as proper handling of private information, proper escalation for out-of-scope requests by customers, and adherence to the user prompt.
How we built it
Our program is centered around the bank teller agent, who can use tools to interact with the bank database. The teller agent receives input from customers via prespecified messages—some of which are malicious. The sophistication of the malicious messages increases as the number of turns increases. It responds to customer requests with tool calls and reasoning behind its actions. The interaction is scored on multiple metrics by Gemini. After the selected number of interactions is completed the program provides a scorecard with how well the agent fared against adversarial input. We used Claude and Claude Code throughout this project. We started using it to test the robustness of our ideas during brainstorming. We then used it to refine our chosen topic and think through the details of the implementation. Throughout the Hackathon, we got to work on our use of AI in programming — refining our planning and delegating skills, and further understanding the current limitations of AI in programming, such as UI.
Our tech stack for this project was Python + FastAPI, Snowflake, Gemini 2.5 Flash, Qwen 2.5 14b running on Vercel, React + Vite, and CSS.
Challenges we ran into
One challenge we ran into was coming up with criteria on how to evaluate the agents. We were uncertain about whether we should make the evaluation criteria deterministic or determined by a grader LLM. We ended up implementing both types of evaluations as some grading criteria required subjective grading. Another challenge in designing the simulation was making sure the criteria were robust and reflected the true capability and risks of the agent's behavior. For example we made sure to track successful responses to normal commands so that an agent that refused all requests wouldn't be able to game the evaluation. Another unique issue we ran into was in making the UI western themed, we had to pay attention to small details such as the movement of the tumbleweed and the background of our simulation.
Accomplishments that we're proud of
We're proud of developing an idea for the project that we believe is a unique take on AI evaluations. Our evaluation provides a fun way to learn about the risks of giving AI agents autonomous control over critical systems.
What we learned
We learned more about the critical thinking that is required to make good AI evaluations. By testing different models for a potential agent, we found that agents tend to stick to their natural actions and shy away from performing human actions. We also learned that it's good to use different-sized models for agentic and evaluation purposes; larger models are less likely to take human input and diverge away from aligning towards a specific task when given a prompt, while smaller models are more capable of following human leads towards an objective.
What's next for Lasso
We want to use the feedback generated from the simulation to fine tune a model to help make more robust agents. We were also planning on implementing other tasks that simulate other real world use cases such as a store clerk, that could represent a manager, who has to keep track of things like inventory, sales, and profits.
Log in or sign up for Devpost to join the conversation.