Interest and Understanding of Deep Learning:
Inspired by the paper, “Less is More: Recursive Reasoning with Tiny Networks”, we plan to implement a novel hybrid model architecture built upon the Tiny Recursive Model (TRM) model architecture. We plan to combine attention mechanisms with recursion from the TRM architecture to create a unique language model that performs similarly, if not better, than traditional transformer architecture language models of the same size in simple reasoning tasks by iteratively improving its outputs internally.
There are many reasons why this architecture is interesting to us, and why implementing it is a good practice in our understanding of deep learning. First, if not evident from the name, these networks are small and thus require less computation and memory than typical deep learning frameworks. As outlined in the abstract, TRMs can be constructed and implemented with 0.01% the size of the parameters required by traditional LLMs and achieve similar or better results for certain puzzle tasks. Thus, we can most likely train these models locally and quickly, allowing us to run many iterations.
Secondly, while TRMs have been tested on grid-like puzzle tasks like Sudoku and ARC, they have not been applied to text-based reasoning tasks yet. We aim to extend the TRM idea in this direction.
Using the recursive reasoning architecture from the TRM paper, our goal is to explore its use in tiny Transformer-based language models. Our idea is to create two models - a standard transformer model and a variant with recursive reasoning. Both models would have the same number of parameters, roughly 10 to 20 million each. By evaluating both models on short math and logic word problems, we aim to understand if recursion allows small models to reason better than traditional approaches with the same parameter size. We can compare the models based on final accuracy and study how the models’ performance changes as a function of the recursion depth.
Unlike the original TRM architecture, which used CNNs and worked only on grid-based problems, our model replaces the CNN block with a Transformer encoder layer that incorporates self-attention, allowing sequential processing of text to produce an initial answer. As per the TRM model, the generated answer is fed back into the model, concatenated with the original question, allowing the model to refine its previous output. This continues in a loop for a specified number of steps. This allows our model to effectively “think” iteratively.
We plan to use filtered training data from datasets like Microsoft’s Ocra-Math-Word-Problems-200k and OpenAI’s GSM8K. The filtration is required due to the minuscule size of our model, and we need to ensure that the inputs for reasoning tasks are simple and small enough for the model to process. These datasets include reasoning task questions and answers like (“Jungkook is the 5th place. Find the number of people who crossed the finish line faster than Jungkook.”, “If Jungkook is in 5th place, then 4 people crossed the finish line faster than him.”).
Key Limitations:
One of the key limitations we are anticipating is getting the right data. Due to the small size of our model (10 - 20M parameters), most relevant reasoning-based datasets available online are too large and require more complex reasoning than we are looking to evaluate. Therefore, we will have to selectively choose appropriate datasets that suit our study.
Furthermore, building this model from scratch will pose a significant challenge. Keeping track of gradients, implementing backpropagation, and attention for a recursive model will require a deep understanding of the framework and careful attention to detail to ensure that weights are being updated properly. Existing deep learning programming frameworks will be useful, but we will not be able to depend on them for every operation in this model.
Finally, as mentioned above, due to the size and training constraints of our model, we intend for our model to work only for small reasoning tasks. It is not going to be able to, nor is it meant to, compete with general LLMs, similar to how the original TRM was not meant to parallel LLMs either.
Log in or sign up for Devpost to join the conversation.