Inspiration
We are a group of PhD students in AI and Robotics at UCL.
What it does
Train RL policies in RLHF fashion using a reward model trained from LLM feedback
How we built it
Python
Challenges we ran into
RL is hard...
Accomplishments that we're proud of
Managed to train a mini grid agent using a reward model
What we learned
Translating observations into text is difficult – we expect VLMs to help here, but Claude is not multimodal yet
What's next for RLAIF using LLMs
Continue scaling to more challenging environments!
Log in or sign up for Devpost to join the conversation.