RLAIF using LLMs

Inspiration

We are a group of PhD students in AI and Robotics at UCL.

Train RL policies in RLHF fashion using a reward model trained from LLM feedback

Python

RL is hard...

Managed to train a mini grid agent using a reward model

Translating observations into text is difficult – we expect VLMs to help here, but Claude is not multimodal yet

Continue scaling to more challenging environments!

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.