VolSent: Regime-Aware NLP Volatility Engine

Inspiration

Financial markets no longer move solely on fundamentals, they react to information flow. We wanted to test whether structured volatility modeling combined with unstructured news embeddings could capture regime shifts in the S&P 500.

Instead of building another app, we built a signal engine designed to survive both bull markets and volatility spikes.

What it does

VolSent predicts next-session S&P 500 volatility magnitude using:

100 days of lagged returns
Financial news headlines available before market open

The model outputs a volatility estimate that can be translated into position sizing or Buy/Sell/Hold decisions via risk thresholds.

How we built it

We built a hybrid quantitative + NLP architecture: Price Feature Engineering

From lagged returns we engineered:
Multi-horizon realized volatility (5–100 day windows)
Shock counters (>2%, >3% return days)
Skew and kurtosis (tail modeling)
Exponentially Weighted Moving Volatility (EWM)
Volatility regime ratios (short vs long-term) These features explicitly model volatility clustering and regime persistence.

NLP Pipeline Headlines were converted into numeric embeddings using:

Word-level TF-IDF (1–2 grams)
Character-level TF-IDF (3–5 grams) to capture tickers, acronyms, and crisis terms Truncated SVD to produce dense latent topic embeddings

Character n-grams were especially impactful in detecting crisis-related language and macroeconomic shocks.

We trained multiple models:

CatBoost (price + text embeddings)
LightGBM (price-only baseline)
Ridge regression (text-only baseline)

The target was log-transformed to stabilize heavy tails. Final predictions were produced via weighted ensembling for robustness across regimes.

Challenges we ran into

The biggest issue was underpredicting volatility spikes. RMSE heavily penalizes missing tail events. Early models performed well in calm markets but failed during regime transitions. We addressed this by:

Adding EWM volatility features
Introducing character-level NLP modeling
Using model ensembling to reduce variance

Accomplishments that we're proud of

Reduced public leaderboard score from 0.26 to 0.15 Built a fully reproducible Kaggle notebook Developed a regime-aware ensemble rather than a single black box model Balanced structured stochastic modeling with unstructured NLP

What we learned

Volatility forecasting is fundamentally a regime problem Character-level NLP captures financial language shifts better than pure word models Ensembling reduces catastrophic tail errors Feature engineering still matters in the transformer era

What's next for VolSent: Regime-Aware NLP Volatility Engine

Replace TF-IDF with FinBERT embeddings Convert volatility prediction into a fully backtested Sharpe-optimized trading strategy Deploy as a real-time risk monitoring dashboard

Strategy Memo (PDF): https://drive.google.com/file/d/1TTrJ19sb5yzfzUc9s0ksvTV8FYoKRd1P/view?usp=sharing

Built With

catboost
kaggle
lightgbm
numpy
pandas
python
scikit-learn
tf-idf
truncatedsvd

Updates

Mário Ferreira started this project — Feb 28, 2026 05:25 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.