Inspiration

I love games. I am huge fan of almost every game out there. From Tetris, to Pokemon, to D&D, and, of course, Catan. What I love most about these games is the marriage between what the optimal play should be and what intuition tells us. While games like Chess, dubbed perfect games due to all information being available to the players, have optimal paths that are defined and known, games like Catan, which add randomness and hidden information, don't necessarily have the same solutions. What might seem like an optimal move in the present might become a terrible blunder in the future. All of it is at the whim of things you can't control or can't see. Thus, I decided to test whether or not Agentic AI, along with machine learning, could iterate upon what is currently available and add an intuitive thinking element to bots like AlphaBeta which use hardcoded rules to play Catan.

What it does

CatanTactics is a multi-agent AI system where four players with distinct personalities compete and negotiate in a full game of Catan. Each agent runs a three-layer decision pipeline:

  1. RL Value Network — scores every legal action by predicted win probability, trained entirely from self-play with no human knowledge of Catan injected
  2. Agentic LLM Layer — orchestrates tool calls each turn to assess position, identify threats, find the best move for a stated goal, and verify tactical soundness with AlphaBeta lookahead
  3. Negotiation System — after each dice roll, players can request favors from opponents with reciprocal offers, accept or decline based on VP gaps and reputation, and track broken promises across the game

The entire system runs locally — no cloud APIs, no external services. The board renders live at localhost:3000 via a Docker-hosted web UI, and a custom negotiation viewer displays the full conversation timeline in real time.

How we built it

The pipeline has four distinct components built from scratch:

Reinforcement Learning The value network is a feedforward neural network mapping 30 board features — VP gaps, resource counts, production scores, reachable settlement spots, road counts, opponent VP differentials — to a win probability between 0 and 1. It trains through three curriculum phases:

$$\text{Phase 1: Random} \rightarrow \text{Phase 2: AlphaBeta} \rightarrow \text{Phase 3: Self-play}$$

Labels use a linear ramp with VP bonuses to solve the credit assignment problem:

$$\text{label}_t = \begin{cases} 0.3 + 0.7 \cdot \frac{t}{n} & \text{winner} \ 0.7 - 0.7 \cdot \frac{t}{n} & \text{loser} \end{cases} + 0.12 \cdot \Delta\text{VP}_t$$

A second fine-tuning phase supervises the model directly on AlphaBeta evaluations, blending 80% AlphaBeta signal with 20% outcome ramp to teach the model what good positions look like at every decision point — not just at game end.

Agentic Tool-Calling Since qwen2.5:14b running locally via Ollama lacks native tool-calling support, a JSON-based fallback protocol was implemented. Each turn the LLM receives the board state and available actions, then calls tools one at a time via structured JSON responses. The conversation accumulates as a multi-turn exchange until execute_action is called. The RL model is exposed as the get_best_move tool, AlphaBeta as simulate_outcome, making the LLM a strategic orchestrator rather than a decision-maker.

Negotiation A custom NegotiationGame subclass intercepts each dice roll to trigger a negotiation phase. Favor requests require a reciprocal offer — one-sided requests are rejected. Reputation scores are computed from the history of honored and broken promises and injected into every LLM context, creating emergent trust dynamics across the game.

Infrastructure

Challenges we ran into

Experience I've never been to a Hackathon. I have never looked at Agentic AI beyond the front-facing parts. I decided to dedicate my weekend to learning about both. I think my lack of experience in Hackathons caused me to lose focus on what the project actually needed and more on what I found interesting or most fun in the moment. I believe my lack of understanding of Agentic AI led me to not fully understanding how powerful and broadly applicable it is. Overall, my lack of experience definitely felt like the biggest thing holding me back

Resources Agentic AI is expensive. Daily limits of Gemini or Claude's APIs are not enough for as large of a use-case as this. I blew through my daily Gemini allowance before even beginning to code. However, I am blessed to live near campus and have access to a relatively powerful computer that I could use to run APIs and LLMs locally. Furthermore, time was at a premium. Every second I spent on something trivial was an hour I lost at the end of the day. Being under such a time crunch was very foreign to me and I found it challenging to balance staying healthy and safe with finishing my project.

Panic Time crunches can make tiny syntax errors feel like the most massive walls you've ever seen. I think the initial panic of setting up the repo, finding an alternative to Gemini, and really outlining my project took away from my critical thinking and reduced the time I was actually productive.

Accomplishments that we're proud of

The first step is always the most difficult one to take. I've never done a Hackathon and I've never even touched Agentic AI. And until my roommates convinced me, I was completely okay with never experiencing either. However, due to their pestering, I decided to take a leap of faith and try two new things at once. While I am proud of my dedication. And I am proud of my pipeline, model, and results. I am most proud knowing that I left it all out there at BitCamp. I am most prideful that I took that first step.

What we learned

The biggest lesson was that each AI paradigm has a distinct strength that the others can't replicate. RL is excellent at pattern recognition across thousands of games but terrible at multi-step strategic reasoning. LLMs reason well about goals and social dynamics but can't evaluate 50 board positions quickly. AlphaBeta searches the game tree precisely but has no model of opponent intentions. The insight that made CatanTactics work was assigning each system exclusively to the problem it's designed for, rather than trying to make any single approach do everything.

What's next for Catan Tactics

The most compelling extension is resource trading — allowing agents to propose actual resource exchanges rather than behavioral favors. This would require modeling opponent hand states under imperfect information and reasoning about trade value relative to current board position, making the negotiation system substantially more complex and realistic.

A second direction is goal-conditioned RL — training the value network on (state, goal) pairs rather than just states, so get_best_move produces action recommendations that are genuinely optimized for the LLM's stated goal rather than general win probability.

The broader vision is applying this architecture to real multi-agent problems. Catan is a proxy for supply chain negotiation, financial trading, and diplomatic coordination — domains with the same combination of imperfect information, adversarial opponents, and social dynamics that make the game hard. The three-layer pipeline generalizes directly.

Share this project:

Updates