Inspiration

We are a group of PhD students in AI and Robotics at UCL.

What it does

Train RL policies in RLHF fashion using a reward model trained from LLM feedback

How we built it

Python

Challenges we ran into

RL is hard...

Accomplishments that we're proud of

Managed to train a mini grid agent using a reward model

What we learned

Translating observations into text is difficult – we expect VLMs to help here, but Claude is not multimodal yet

What's next for RLAIF using LLMs

Continue scaling to more challenging environments!

Built With

Share this project:

Updates