Inspiration
Financial markets no longer move solely on fundamentals, they react to information flow. We wanted to test whether structured volatility modeling combined with unstructured news embeddings could capture regime shifts in the S&P 500.
Instead of building another app, we built a signal engine designed to survive both bull markets and volatility spikes.
What it does
VolSent predicts next-session S&P 500 volatility magnitude using:
- 100 days of lagged returns
- Financial news headlines available before market open
The model outputs a volatility estimate that can be translated into position sizing or Buy/Sell/Hold decisions via risk thresholds.
How we built it
We built a hybrid quantitative + NLP architecture: Price Feature Engineering
- From lagged returns we engineered:
- Multi-horizon realized volatility (5–100 day windows)
- Shock counters (>2%, >3% return days)
- Skew and kurtosis (tail modeling)
- Exponentially Weighted Moving Volatility (EWM)
- Volatility regime ratios (short vs long-term) These features explicitly model volatility clustering and regime persistence.
NLP Pipeline Headlines were converted into numeric embeddings using:
- Word-level TF-IDF (1–2 grams)
- Character-level TF-IDF (3–5 grams) to capture tickers, acronyms, and crisis terms Truncated SVD to produce dense latent topic embeddings
Character n-grams were especially impactful in detecting crisis-related language and macroeconomic shocks.
We trained multiple models:
- CatBoost (price + text embeddings)
- LightGBM (price-only baseline)
- Ridge regression (text-only baseline)
The target was log-transformed to stabilize heavy tails. Final predictions were produced via weighted ensembling for robustness across regimes.
Challenges we ran into
The biggest issue was underpredicting volatility spikes. RMSE heavily penalizes missing tail events. Early models performed well in calm markets but failed during regime transitions. We addressed this by:
- Adding EWM volatility features
- Introducing character-level NLP modeling
- Using model ensembling to reduce variance
Accomplishments that we're proud of
Reduced public leaderboard score from 0.26 to 0.15 Built a fully reproducible Kaggle notebook Developed a regime-aware ensemble rather than a single black box model Balanced structured stochastic modeling with unstructured NLP
What we learned
Volatility forecasting is fundamentally a regime problem Character-level NLP captures financial language shifts better than pure word models Ensembling reduces catastrophic tail errors Feature engineering still matters in the transformer era
What's next for VolSent: Regime-Aware NLP Volatility Engine
Replace TF-IDF with FinBERT embeddings Convert volatility prediction into a fully backtested Sharpe-optimized trading strategy Deploy as a real-time risk monitoring dashboard
Strategy Memo (PDF): https://drive.google.com/file/d/1TTrJ19sb5yzfzUc9s0ksvTV8FYoKRd1P/view?usp=sharing
Built With
- catboost
- kaggle
- lightgbm
- numpy
- pandas
- python
- scikit-learn
- tf-idf
- truncatedsvd

Log in or sign up for Devpost to join the conversation.