Inspiration

Our inspiration comes from the technological and community impact: limitations of classical finance when handling large, volatile trades and the idea that finance should not be limited to just those with accessible means, allowing for equitable opportunity. The foundation of OpenFinance is challenging the status quo of institutional trading.

Major players like JPMorgan Chase, Goldman Sachs, Citadel Securities, Virtu Financial, Jane Street, and Hudson River Trading all use proprietary optimal execution algorithms, leaving a gap for open, interpretable innovation and questions on the community impact.

The Limitation of Analytical Solutions

  • Classical Optimal Execution models, like the original Almgren-Chriss (A-C) framework, rely on mathematical tractability, yielding a non-adaptive, deterministic solution like TWAP (Time-Weighted Average Price). But markets are stochastic - TWAP can't react to price trends, volatility spikes, or microstructure shifts.

What We Learned

Institutional traders don't just care about average returns (mean); they also want to minimize execution variance. This tradeoff between expected cost and uncertainty defines risk aversion, captured through the Mean-Variance Utility Criterion. Our agent explicitly balances profit v.s. risk, adapting its behavior to volatility in real time.

What it does

OpenFinance is an advanced Deep Reinforcement Learning Agent that determines the optimal pace of trade execution for large stock orders, with risk management as its central principle.

  • Continuous Control: Standard RL agents act discretely (Buy/Hold/Sell). Our DDPG agent outputs a continuous action, e.g., "Sell 12.4% of remaining inventory."
  • Risk Aversion Objective: The agent's reward is a Mean-Variance Utility Function: R_t = -(E[C_t] + lambda * Var[C_t]). Setting lambda = 2.0 makes the policy risk-averse, forcing faster liquidation under volatility while still seeking favorable prices.
  • Performance: Achieved an average normalized reward of -0.0747, meaning an average execution cost of 7 cents per share (~ $90,000 on a $5M order). This is a 92% improvement over naive execution methods like TWAP that might lose several dollars per share due to poor timing and market impact, demonstrating institutional-grade performance and adaptive intelligence. To put it into context, selling 10,000 shares at $500 a share (a $5m trade) would cost only about $747 total in execution costs with our agent. Professional firms consider under 5 cents per share excellent performance, so our result is competitive with industry standards.

How we built it

OpenFinance was built over a weekend using a highly parallelized Actor-Critic (DDPG) architecture, trained entirely on a custom high-fidelity Almgren-Chriss simulator. We built everything from scratch and did not rely on existing APIs, frameworks, or wrappers. Unlike prior research implementations that assume perfect data and stationary dynamics, we redesigned the Almgren-Chriss simulator for RL compatibility, introducing stochastic volatility, high-frequency real market data, and adaptive reward scaling. We also reinterpretated A-C as a learnable environment, enabling the agent to interact with market microstructure rather than merely backtest static data. This design shift from theoretical replication to interactive, teachable finance physics is what makes OpenFinance an innovation in accessible quantitative research, not a reproduction of it. Most people use A-C as a static mathematical benchmark; we turned it into a simulated world for an interactive agent (with step(), state vectors, and stochastic noise). Our 10-dimensional state vector is another custom design, not copied. It's our way of approximating the belief state for a POMPDP.

  • RL Architecture (DDPG Core): Implemented a Deep Deterministic Policy Gradient framework with: 1) Actor, Critic, and two Target Networks, 2) Actor outputs continuous actions, 3) Critic evaluates them via Q-values, 4) Target networks stabilize the "Deadly Triad" (bootstrapping, off-policy learning, function approximation).
  • Market Simulator (Environment Physics): Built the Discrete-Time Almgren-Chriss model - temporary impact = 3e-6, permanent impact = 1e-7, and a noise term. We used 50,000 real 1-minute NVIDIA price bars to ensure realism. Each episode simulates a full liquidation horizon under stochastic volatility.
  • Stabilization: Experience Replay Buffer to break temporal correlations. Polyak Averaging (tau = 0.0005) for stable target updates. Early Stopping (500-episode patience) to prevent catastrophic forgetting.
  • State Representation: A 10-dimensional augmented state vector, including: 5 lags of normalized lag returns, inventory fraction remaining, temporary and permanent impact history, and recent volatility estimates. This approximates the Belief State in a Partially Observable MDP (POMPDP).

Challenges we ran into

  1. Continuous Control Convergence: DDPG is unstable due to the aforementioned Deadly Triad. We mitigated this with small tau updates, decaying Ornstein-Uhlenbeck noise, and conservative learning rates.
  2. Market Physics Calibration: Tuning A-C parameters against a high tau created sensitive market behavior. Treating the simulator itself as a stochastic control system yielded stable, realistic liquidation dynamics.
  3. Data Fidelity vs Quantity: API pagination limits forced us to use 9 months of 1-minute, high-resolution data instead of multiple years - but this improved realism over longer, low-fidelity samples.

Accomplishments that we're proud of

  • Institutional-Level Performance: Cost reduced by 92% compared to TWAP; converged at 7 cents a share, 1.5 basis points execution cost.
  • Adaptive Intelligence: The agent occasionally achieved positive returns (+1.79) by opportunistically selling into favorable price movements - impossible for deterministic benchmarks.
  • Theoretical Discipline: Fully grounded in utility theory, policy gradient optimization, and finite-horizon MDPs as outlined in Foundations of Reinforcement Learning with Applications in Finance, a textbook recommended by our professor.
  • Open Innovation: We didn’t just implement the Almgren–Chriss model. We turned it into an interactive RL environment that learns market dynamics from real 1-minute data. We engineered a custom 10-dimensional state representation capturing historical trends and impact effects, bridging theoretical finance with modern reinforcement learning. Our innovation wasn’t inventing a new model, but making institutional-grade execution research open, transparent, and teachable.

What we learned

  • Policy Gradient Power: Continuous-action problems require policy gradient methods. DDPG's Actor-Critic loop proved indispensable.
  • Risk-Return Tradeoff (lambda): Lambda is the control knob of behavior. Low lambda means patient, opportunistic strategy. High lambda means defensive, volatility-averse strategy. We demonstrated direct control over trading temperment. State History as Belief: Adding history lags was an effective way to encode memory, approximating the belief state in real-world POMPDPs under hackathon constraints.

What's next for AlphaExec - Risk Averse DDPG

We want to continue to improve existing public quantitative finance research and knowledge and showcase their relevance and applicability, urging and encouraging others outside of Wall Street to build their own, and further remove the disparity between the ultra-wealthy and regular people when it comes to trading algorithms.

Built With

  • ddpg-reinforcement-learning
  • lucide-react
  • matplotlib
  • netlify
  • numpy
  • nvda-historical-stock-data
  • pandas
  • polygon.io-api
  • python
  • react
  • recharts
  • tailwind-css
  • tensorflow
  • typescript
  • vite
Share this project:

Updates