Fahgeddaboudit (A study on continual learning)

The ability for large networks to retain emergent behaviours depends greatly on storing a large body of data to keep the network's bank of experiences to learn off of diverse and vast. Even if no data is discarded, as we add more recent data from the world, the distribution within our bank will shift in hard to measure ways. This is known to induce catastrophic forgetting in LLMs and reinforcement learning models, though the former is more robust due to its absurd parameter count. Resistance to forgetting, or memory stability, is a major theme of research in continual learning- how well can we retain a mapping from a data distribution that we don't have access to anymore? Current approaches using backpropagation require a combination of techniques to retain information, and sometimes results in a tradeoff between memory stability (holding on to memories) and plasticity (acquiring new memories).

Almost all of the research currently in continual learning use backpropagation as a fundamental tool in tuning networks, but only recently have local, and semi-local learning algorithms taken the stage, mostly to see if these algorithms can match, or supplement the reliable backpropagation algorithm, and very rarely comparing the two on catastrophic forgetting benchmarks (Feng et. al, 2025). The principle motivating local learning is biological plausibility. Backpropagation applies a global learning rule to every layer and every neuron in the network during an update, giving each layer upstream of the forward information flow knowledge of exactly how wrong the following layers are, meaning information flows directly between neurons that are only connected via transitivity. This is far different than the very local rules of biological learning, which only care about pre- and post-synaptic activations and don't necessarily require an explicit error to correct itself.

The forward-forward (FF) algorithm is one such implementation of local learning rules, and does not require even an output layer for classification tasks. Each layer in an FF network individually contributes to a prediction, while passing on increasingly better representations of information onto the next. There is no notion of "loss" in the backpropagation sense here. The objective maximized in this case, then, is congruence (high neuronal activation) when seeing a good example, and incongruence (low neuronal activation) otherwise. The notion of congruence, (or goodness, in Hinton's preliminary report) is left intentionally abstract as this could represent anything from a label "cat" concatenated onto a picture of the same creature (congruent) or a label "cat" along with a picture of a bird. Combining a very general energy-minimizing objective with local learning rules gives credit to this algorithm's biological plausibility; given that, it merits some investigation as to whether or not it provides resistance to forgetting similar to that of animal brains.

In this preliminary study, I analyze the performance of an FF-trained neural network on the MNIST digit classification task with full data, and track its degradation over time using a confusion matrix with a backpropagation-trained baseline with as equivalent a hyperparameterization as possible.

Other studies using different local training algorithms such as EASE and APER (Feng et. al, 2025) have been carried out with BP as a baseline, but mostly apply to large vision transformer domains, and I limit my study on small classification networks to see the effect in relatively lower dimensions of information.