Inspiration

I'm a big fan of LMArena and their benchmarks for Design, Chat, Coding, etc. Unfortunately, they don't have one for losing money as fast as possible, so I thought I'd make one for AI investing workflows.

What it does

It's a frontend UI for people to compare models at investment analysis tasks. Models from Anthropic and OpenAI face off in epic battles, do a bunch of tool calls to yahoo finance/tavily/the SEC, and the user chooses which one is better.

How we built it

The backend is a mastra server running on an EC2 medium instance with 3 MCP servers (yahoo finance, tavily, SEC EDGAR). The frontend is a NextJS app with MastraClient and CopilotKit components. Logs and votes are saved to an RDS postgres instance.

Challenges we ran into

  • Connecting the frontend to the mastra server (AI SDK from vercel didn't work, so used copilotkit instead)
  • Setting up networking and deploying to ec2 -CopilotChat observability, passing the messages to parent components was tricky -It's not possible to switch models for agents from frontend, had to make a bunch of separate mastra agents -Serializing the message and tool call objects to postgres jsonb was hard

Accomplishments that we're proud of

-It works!

What we learned

-Integration is hard -deployment is harder -MCPs for claude code docs are not perfect

What's next for Wall Street Bench

-Add more models through openrouter -Launch publicly -Make it more stable -Make the leaderboard real -Calculate stats -Clean up the code

Built With

  • claude
  • copilotkit
  • mastra
  • nextjs
  • python
  • tavily
Share this project:

Updates