Inspiration

An image can convey thousand words, can the same image be used to generate ideas or debates. We ponder how much expressive can an image be, especially with Multi Modal LLMs like Pixtral

What it does

Given an image , it generates podcast, ideating on the content of the image , we can choose to select the style that we want for it to respond using the language as well as visual capabilities of the Pixtral Model.

How we built it

We use Pixtral API to generate the understanding of the image and then ask it to generate a two person podcast on the topic of the given image. We use Real Time Open AI API to generate the voice from the text generated from Pixtral Model.

Challenges we ran into

The prompt engineering for Pixtral to steer the format and style of responses of Pixtral model.

Accomplishments that we're proud of

The model works as expected can capture the true understanding of it and create flexible dual person perspective podcast ideating on the given prompt and ideas.

What we learned

Multi Modal models

What's next for Untitled

Built With

  • api
  • mistal
  • pixtral
  • realtime
  • voice
Share this project:

Updates