Inspiration

The method we implemented was already something used in robotics, so we thought to implement it for LLMs.

What it does

It uses direct preference optimization on the large model output to train a smaller model, increasing difficulty at each step. This method also allows us to train a model on any desired task without the need for a dataset. The project is user-friendly with a user interface to provide an easier experience.

How we built it

We built a pipeline where we prompt the large model and parse the output using Langchain. Afterwards, the generated dataset is passed to the training step using DPOTrainer from HuggingFace. We then select the worst-ranked exercises based on average reward to prompt the large model again and generate more tasks of this type.

Challenges we ran into

We faced multiple challenges, mostly generating a unique and high-quality dataset. Additionally, implementing the whole pipeline in one go was one of the most difficult parts.

Accomplishments that we're proud of

(The section is left blank, so no corrections are needed.)

What we learned

We learned methods of prompting and parsing as well as new RL techniques.

What's next for smol.

Support more training algorithms and more precise datasets at generation.

Built With

  • huggingface
  • langchain
  • nebius
  • python
Share this project:

Updates