Inspiration
I already had some simple workflows in pure Python for creating, improving, and automating text2image generation, either running locally on Mac, using tools like Ollama, LMStudio, and Drawthings, or on cloud platforms Groq and Cloudflare.
I've experimented with various AI agent libraries, including Agno, CrewAI, and DSPy. However, they all focus on "LLM" and are not suitable for easily integrating into multimodal workflows. Therefore, I've decided to develop a Python library for that purpose.
What it does
It allows to create mutimodal workflows, for example
- automate captioning images
- improves a “basic” prompt and generates images with text2image models
- take an existing image, analyse it and generate a text2image prompt to create something similar
- automate batch image editing processing
How we built it
I created a project skeleton using a prompt on chat.qwen.ai. I then used this skeleton as a starting point and partially merged it with some of the code I already had.
Challenges we ran into
Some platforms lack documentation, particularly on how to input images into them (for image2image).
Accomplishments that we're proud of
I have developed a first version that works, and I have tested some simple workflows.
What we learned
I have improved some of my Python skills.
What's next for MultimodalAgentsForFlux
Improve the library :
- for now quite a lot of things are hard coded, make it more generic
- try to create others workflows with sounds or videos.
- add more tools, models and platforms
Built With
- cloudflare
- drawthings
- groq
- lmstudio
- ollama
- python

Log in or sign up for Devpost to join the conversation.