Dolly AI

Inspiration

The inspiration comes from the challenges facing UK education and human interaction. The increasing absenteeism, mental health difficulties among children, and decreases in communication skills post-COVID have created a need for innovative solutions to support students and revolutionize human interaction.

What it does

Dolly AI allows you create lifelike AI avatars from a single image that you can have real-time conversations with. Dolly AI leverages Pixtral 12B to reason and answer questions about its surrounding. It aims to revolutionize human interaction by allowing people to connect with anyone, anytime, anywhere through AI-powered avatars. We focus on the specific use case of providing access to high-quality and engaging tutoring for students.

How we built it

The users audio is passed through the gpt4o realtime API. This then controls Pixtral 12B, with the video stream also as input to answer questions about its surroundings The output audio from GPTo then drives "Real3D-Portrait", a deep learning model that generates the lip and body movements from an input image. Finally, eleven labs dubs the voice.

Challenges we ran into

The biggest challenge was speeding up the inference of Real3D-Portrait since it is a large model The other main challenge was streaming the output video data from the model and switching it between generations