Inspiration

The inspiration comes from the challenges facing UK education and human interaction. The increasing absenteeism, mental health difficulties among children, and decreases in communication skills post-COVID have created a need for innovative solutions to support students and revolutionize human interaction.

What it does

Dolly AI allows you create lifelike AI avatars from a single image that you can have real-time conversations with. Dolly AI leverages Pixtral 12B to reason and answer questions about its surrounding. It aims to revolutionize human interaction by allowing people to connect with anyone, anytime, anywhere through AI-powered avatars. We focus on the specific use case of providing access to high-quality and engaging tutoring for students.

How we built it

The users audio is passed through the gpt4o realtime API. This then controls Pixtral 12B, with the video stream also as input to answer questions about its surroundings The output audio from GPTo then drives "Real3D-Portrait", a deep learning model that generates the lip and body movements from an input image. Finally, eleven labs dubs the voice.

Challenges we ran into

The biggest challenge was speeding up the inference of Real3D-Portrait since it is a large model The other main challenge was streaming the output video data from the model and switching it between generations

Accomplishments that we're proud of

Solving the above issues! Also, since the realtime API is very recent, this is probably the first project of its kind.

What we learned

That portrait animation models are great, but there's still room for improvement in terms of emotions and more lightweight miodels

What's next for Dolly AI

Improve the facial emotions of our generations Provide agentic capabilities to the model (e.g. scheduling meetings) and expanding its tool usage

Built With

Share this project:

Updates