Council of Alphas

Inspiration

LLMs are increasingly used to generate trading strategies, but they have a well-documented tendency to converge on the same familiar patterns - a problem known as mode collapse. Ask Claude to "write a trading strategy" ten times and you'll get variations of the same moving average crossover. The literature confirms this: AlphaAgent (KDD 2025) identifies factor homogenization as a core issue, and EMNLP 2025 work shows that strict prompting templates further reduce output diversity.

At the same time, pure evolutionary search avoids mode collapse through population diversity, but requires hundreds or thousands of candidate evaluations - each involving a full backtest. That's prohibitively expensive when your generation operator is an LLM API call.

We wanted to combine the best of both: the diversity guarantees of evolutionary algorithms with the high-quality starting points that LLMs provide.

What it does

Council of Alphas is an evolutionary multi-agent pipeline that generates, selects, hybridizes, and filters trading strategies for SOL/USD.

12 strategies (3/family) → 4 champions (1/family) → 3 hybrids → regime filter → ranked survivors

Speciation - 4 specialist agents (Claude Opus), each locked to a distinct strategy family (trend, momentum, volatility, volume) with randomized indicator subsets, generate 3 candidate strategies each (12 total)
Niche Selection - The best strategy per family survives as a champion. Fitness = $\text{Sharpe} \times \ln(N) \times \text{Coverage}$
Hybridization - 3 deterministic templates (Consensus Gate, Regime Router, Weighted Combination) combine all 4 champions into hybrid strategies
Regime Filtering - A deterministic 2D filter classifies each bar into regime buckets (session $\times$ trend $\times$ volatility = 24 buckets) and disables trading wherever the strategy has negative Sharpe
Ranking - Survivors are ranked by fitness score

The result: diverse, regime-aware strategies produced in a single round of LLM calls and one deterministic optimization pass.

How we built it

Backend (Python): The core engine builds a state matrix from raw OHLCV data - adding technical indicators, regime labels (session, trend, volatility), and Triple Barrier Method labels. A numba-accelerated vectorized backtester simulates each strategy with compounding equity, ATR-based position sizing, and realistic fees. The pipeline module handles LLM interaction: prompt construction with randomized indicator subsets, async parallel specialist calls, champion selection, and hybrid building. The optimizer applies the regime filter post-pipeline.

Frontend (React + Vite + Tailwind): A 6-tab dashboard visualizes the entire pipeline: architecture overview, lineage tree showing the evolutionary funnel, regime filter explainer, per-strategy tearsheets with equity curves and drawdown charts, leaderboard with podium, and PnL correlation matrix between profitable hybrids.

Orchestration: A single orchestrator.py ties everything together - load data, build state matrix, run speciation, select champions, build hybrids, optimize, rank. One command: python run_stage3.py.

Challenges we ran into

Mode collapse was real. Early runs with unconstrained prompts produced 12 near-identical moving average strategies. Locking each specialist to a strategy family and randomizing their indicator subsets was the key fix - diversity by construction rather than by hope.

Critic parsing. We originally had an LLM-based Scientist/Critic/Refiner loop for post-processing. The critic's verdict field kept breaking - markdown asterisks (**CONTINUE**) were parsed as UNVIABLE, and truncated responses from insufficient max_tokens caused silent failures. We eventually replaced the entire LLM-based refinement with a deterministic 2D regime filter - simpler, faster, and more reliable.

Timeframe sensitivity. Our first pipeline ran on 15-minute candles and produced strategies with near-zero Sharpe. Switching to 1-hour candles was a turning point - the signal-to-noise ratio improved dramatically, and the pipeline started producing genuinely profitable hybrids.

Trade counting and win rate. We initially counted timeouts (trades that hit the time barrier without hitting TP or SL) as losses, which deflated win rates. Excluding timeouts from the win rate calculation gave a much more accurate picture of strategy quality.

Accomplishments that we're proud of

All 3 hybrids survived the regime filter in our best run - the first time the full pipeline produced 3 ranked strategies instead of filtering most out
Consensus Gate achieved a 1.76 annualized Sharpe with only 499 trades over 4 years, +82.4% return, and just -8.0% max drawdown
The regime filter works. It identified and silenced unprofitable regime buckets (e.g., low-volatility consolidation periods) without any LLM involvement - pure math
12 Opus calls total. The entire pipeline runs in one round of LLM generation. No iterative refinement loops, no hundreds of evaluations. Efficient by design
The dashboard. Six tabs that tell the full story from raw pipeline architecture to per-strategy tearsheets. Deployed live at https://council-of-alphas.vercel.app

What we learned

Structural diversity beats prompt engineering. No amount of "be creative" in the prompt matches the effect of locking agents to different strategy families with different indicator subsets
Deterministic post-processing beats LLM refinement for optimization tasks where the search space is well-defined. The regime filter is faster, cheaper, and more reliable than the Scientist loop it replaced
Regime awareness is critical. A strategy that's profitable overall can still lose money consistently in specific market conditions. Filtering by regime bucket is a simple but powerful improvement
Evaluation design matters more than generation. The fitness function ($\text{Sharpe} \times \ln(N) \times \text{Coverage}$) shaped the entire pipeline's behavior - rewarding both quality and breadth of trading activity