BENJI — Project overview

BENJI is a multimodal neural speech decoding system: learning a shared representation from brain recordings during language processing, mapping it to words, and reconstructing modality-specific neural patterns from those semantic codes.

Inspiration

We started from a simple question: can we read out what word someone is processing from their brain activity, and can we invert that abstraction back into realistic neural signals? That connects several motivations:

Scientific: Relating high-dimensional EEG, MEG, and fMRI to word-level semantics helps test hypotheses about shared representations across modalities and time scales (e.g., evoked responses vs. slow BOLD).
Engineering: A two-stage design—first alignment / decoding into a semantic space, then reconstruction—mirrors how we might eventually build interpretable BCIs or analysis tools without collapsing everything into a single end-to-end black box.
Practical: Real datasets differ in format, channels, and preprocessing; we wanted a pipeline that could combine modalities when labels align and still run when one modality is missing in a batch.

What it does

Stage 1 — Speech decoding (semantic retrieval):
Modality-specific encoders map neural inputs to a shared embedding space:

EEG / MEG: EEGNet-style convolutional encoders with multi-scale temporal filtering (three parallel branches: ~200 ms, ~400 ms, ~600 ms kernels capturing delta/theta, alpha, and beta dynamics), followed by depthwise spatial and separable conv blocks.
fMRI: An MLP over PCA-reduced voxel patterns (81k raw voxels → 1000 PCA components; HRF peak sampled ~4 s after word onset).

Training aligns neural embeddings to frozen GloVe word vectors (300-dim, pairwise cosine sim ~0.08–0.43 for concrete nouns) via a small trainable SharedEmbeddingProjector, using three complementary loss terms:

Supervised contrastive loss (SupCon): same-word samples are treated as multiple positives in the batch, giving richer gradient signal than diagonal CLIP pairing.
Cross-modal alignment loss: contrastive pairs across EEG/MEG/fMRI when the same stimulus appears in multiple modalities.
Subject adversarial loss: gradient reversal on a subject classifier to discourage subject-specific shortcuts in the shared space.

At evaluation time, a sample is classified by cosine-similarity retrieval over a fixed vocabulary embedding matrix (top-1 and top-3 accuracy).

Stage 2 — Reconstruction:
Given a word embedding (through the same projector used in Stage 1) and a subject embedding, modality-specific decoders predict neural signals:

EEG / MEG: _TemporalDecoder with FiLM subject conditioning (gamma/beta scale-shift at each MLP layer) and transposed convolutions to upsample to the full time series.
fMRI: MLP decoder to voxel vectors.

Stage 1 encoder weights are fully frozen during Stage 2; only decoders and subject embeddings are trained. Losses combine MSE, frequency-band penalties for EEG/MEG (delta–beta), and spatial smoothness for fMRI.

Orchestration: Training can run locally or on Modal (A100 GPU jobs, data/checkpoint volumes) via scripts/modal_train.py.

How we built it

Stack: Python 3.10+, PyTorch, YAML configs under configs/, and a modular layout where workers/ holds implementations (models, training, data, eval) and root src/ re-exports for a simple import path.
Data: Three real datasets — N400 EEG (128-channel), Sherlock MEG (306-channel), and Huth et al. fMRI (HDF5 BOLD timeseries + TextGrid word alignments). JSON splits index processed .npz files; MultiModalDataset builds a flat index over modalities and collates per-modality batches so EEG, MEG, and fMRI can coexist in one training step without forcing trial-aligned tuples.
Text embeddings: GloVe-wiki-gigaword-300 (300-dim) with a deterministic hash fallback if gensim is unavailable; BERT [CLS] available as an alternative but empirically produces near-identical embeddings (cosine sim 0.95+) for concrete nouns, making it unsuitable as a retrieval target.
Vocabulary splits: A small closed vocabulary (~6–8 words) with 2 words held out as out-of-set (OOS) for zero-shot evaluation; test subjects are also held out entirely to evaluate cross-subject generalization. Within-subject val fraction is 0.2.
Quality bar: Unit tests for encoders, losses, training glue, and integration-style checks; logging to CSV and metric curves saved under checkpoint dirs.

Challenges we ran into

Multimodal alignment: EEG, MEG, and fMRI do not share the same physics, noise, or temporal resolution; a single embedding space required careful loss weighting and handling of missing modalities per batch.
Subject variability: The same word looks different across people and sessions; subject adversarial training (Stage 1) and FiLM conditioning (Stage 2) partially address this but do not solve domain shift entirely.
fMRI preprocessing: Raw HDF5 BOLD timeseries contain NaN values for out-of-brain voxels; these must be replaced before PCA. HRF delay (~4–6 s) requires careful TR-offset alignment between word onset times (from forced-alignment TextGrids) and the sampled volume.
Text embedding geometry: BERT [CLS] embeddings for single concrete nouns cluster near cosine sim 0.95+, effectively making every word look the same to the contrastive loss. Switching to GloVe (sim 0.08–0.43) was necessary for the retrieval task to be learnable.
Small vocabulary, small datasets: With only 6–8 training words and ~100–200 trials per modality per subject, the signal-to-noise ratio is low and val accuracy can be noisy (EEG val set is ~16 samples even at 20% fraction).
Closed vocabulary: Retrieval is always over a fixed set of words; true open-vocabulary decoding is out of scope for the current design.

Accomplishments that we're proud of

A full two-stage pipeline that is documented in code: encode → align to language → decode words; then word + subject → reconstruct signals with physiology-aware auxiliary losses (frequency bands, spatial smoothness).
Meaningful retrieval above chance: MEG achieves ~0.46 top-1 (vs. ~0.17 chance for 6 words), fMRI ~0.31, demonstrating that GloVe-aligned SupCon training extracts word-discriminative features from brain signals.
Thoughtful architecture choices: multi-scale temporal branches (delta/theta/alpha/beta), FiLM subject conditioning in decoders, SupCon with multiple positives per class, and gradient reversal for subject-invariant representations.
Three-modality integration: EEG (N400), MEG (Sherlock), and fMRI (Huth et al.) are preprocessed and trained together in a unified pipeline, each with appropriate encoder/decoder architectures.
Cloud-ready training on Modal (A100, persistent volumes) so experiments are not tied to a local machine.

What we learned

Contrastive learning is subtle: CLIP-style diagonal pairing and SupCon (many positives per class) behave very differently when words repeat in a batch; SupCon gives richer gradient signal for word-level alignment and is the right choice here.
Text embedding geometry matters more than model size: A 300-dim GloVe model provides far better contrastive targets for this task than BERT, simply because its geometry for concrete nouns is well-separated.
fMRI is surprisingly competitive: Despite operating on much slower signals (TR=2s BOLD vs. millisecond EEG/MEG), PCA-reduced fMRI responses carry enough word-discriminative signal to surpass EEG in this setup.
Subject identity belongs in Stage 1 and Stage 2 differently: adversarial regularization for invariance in Stage 1, FiLM modulation for reconstruction diversity in Stage 2.

What's next for BENJI

Stronger generalization: Subject-adaptive layers, test-time adaptation, or explicit domain alignment for new sessions and devices.
Richer fMRI modeling: Optional 3D CNNs or surface-based models if spatial neighborhoods are preserved rather than flat PCA vectors.
Tighter train/eval alignment: Projected vocabulary banks for retrieval metrics, calibration, and abstention when confidence is low.
Open-vocabulary direction: Retrieval over large embedding stores, or coupling to phoneme/subword units, for vocabularies beyond a fixed list.
Science-facing evaluation: N400-style analyses, ROI-level fMRI agreement, and round-trip consistency reports as first-class artifacts in every run.
Productization: Cleaner CLI, packaged configs, and documentation for collaborators who are not deep in the repo every week.

Built With

cnn
glove
mlp
python
transformers

Updates

Kazuma J Hakushi started this project — Apr 12, 2026 02:56 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.