Inspiration
The inspiration comes from the challenges facing UK education and human interaction. The increasing absenteeism, mental health difficulties among children, and decreases in communication skills post-COVID have created a need for innovative solutions to support students and revolutionize human interaction.
What it does
Dolly AI allows you create lifelike AI avatars from a single image that you can have real-time conversations with. Dolly AI leverages Pixtral 12B to reason and answer questions about its surrounding. It aims to revolutionize human interaction by allowing people to connect with anyone, anytime, anywhere through AI-powered avatars. We focus on the specific use case of providing access to high-quality and engaging tutoring for students.
How we built it
The users audio is passed through the gpt4o realtime API. This then controls Pixtral 12B, with the video stream also as input to answer questions about its surroundings The output audio from GPTo then drives "Real3D-Portrait", a deep learning model that generates the lip and body movements from an input image. Finally, eleven labs dubs the voice.
Challenges we ran into
The biggest challenge was speeding up the inference of Real3D-Portrait since it is a large model The other main challenge was streaming the output video data from the model and switching it between generations
Accomplishments that we're proud of
Solving the above issues! Also, since the realtime API is very recent, this is probably the first project of its kind.
What we learned
That portrait animation models are great, but there's still room for improvement in terms of emotions and more lightweight miodels
What's next for Dolly AI
Improve the facial emotions of our generations Provide agentic capabilities to the model (e.g. scheduling meetings) and expanding its tool usage
Built With
- elevenlabs
- javascript
- mistral
- openaid
- python
- pytorch
Log in or sign up for Devpost to join the conversation.