Inspiration
An image can convey thousand words, can the same image be used to generate ideas or debates. We ponder how much expressive can an image be, especially with Multi Modal LLMs like Pixtral
What it does
Given an image , it generates podcast, ideating on the content of the image , we can choose to select the style that we want for it to respond using the language as well as visual capabilities of the Pixtral Model.
How we built it
We use Pixtral API to generate the understanding of the image and then ask it to generate a two person podcast on the topic of the given image. We use Real Time Open AI API to generate the voice from the text generated from Pixtral Model.
Challenges we ran into
The prompt engineering for Pixtral to steer the format and style of responses of Pixtral model.
Accomplishments that we're proud of
The model works as expected can capture the true understanding of it and create flexible dual person perspective podcast ideating on the given prompt and ideas.
What we learned
Multi Modal models
What's next for Untitled
Built With
- api
- mistal
- pixtral
- realtime
- voice
Log in or sign up for Devpost to join the conversation.