Inspiration
Control-Net was an exciting advancement in the field but was focused on a single modality. But the world is largely multi-modal, and there is no critic that governs control-net or co-pilots in general. So we wanted to build a Control-Net (solely via prompting for now) for the multi-modal world while it's being governed. We thought the best way to show this is through literature!
What it does
We built a visual novel/poem book where the authors can generate content based on images (or text) and change stories at any point of the book based on the rest of the book. Inspired by the latest versions of coding co-pilot where best performances stem from ingesting the entire codebase. All changes are monitored and evaluated by the critic to ensure the best changes are being maintained.
How we built it
Started from basic python scripts to realize the prompts and their use case. Moved forward from there to build an interactive UI around it.
Challenges we ran into
Maintaining context while dynamically changing all parts of a book was difficult from the prompting perspective for the critic as it was easily confused. So we had to carefully design the prompt to ensure the critic had fair and sensible evaluations.
Accomplishments that we're proud of
Multi-modal critics for multi-modal copilots are a very new concept (https://arxiv.org/abs/2410.02712). We are quite happy with our use-case and its integration into pixtral.
What we learned
The nuances followed in prompting text models differ quite differently when interacting with multi-modal models. Most of our time was invested in adapting to this (and experimenting with mistral's suite).
What's next for Co-Author
- Building a better UI for ease of use for actual users.
- Integrating more sophisticated prompts to maximize context understanding.
- Finetune models to our use case. Especially the critic.
Built With
- mistral
- python
- streamlit
Log in or sign up for Devpost to join the conversation.