MultimodalAgentsForFlux

Inspiration

I already had some simple workflows in pure Python for creating, improving, and automating text2image generation, either running locally on Mac, using tools like Ollama, LMStudio, and Drawthings, or on cloud platforms Groq and Cloudflare.

I've experimented with various AI agent libraries, including Agno, CrewAI, and DSPy. However, they all focus on "LLM" and are not suitable for easily integrating into multimodal workflows. Therefore, I've decided to develop a Python library for that purpose.

What it does

It allows to create mutimodal workflows, for example

automate captioning images
improves a “basic” prompt and generates images with text2image models
take an existing image, analyse it and generate a text2image prompt to create something similar
automate batch image editing processing

How we built it

I created a project skeleton using a prompt on chat.qwen.ai. I then used this skeleton as a starting point and partially merged it with some of the code I already had.

Challenges we ran into

Some platforms lack documentation, particularly on how to input images into them (for image2image).

Accomplishments that we're proud of

I have developed a first version that works, and I have tested some simple workflows.

What we learned

I have improved some of my Python skills.

What's next for MultimodalAgentsForFlux

Improve the library :

for now quite a lot of things are hard coded, make it more generic
try to create others workflows with sounds or videos.
add more tools, models and platforms

Built With

cloudflare
drawthings
groq
lmstudio
ollama
python

Updates

Fabrice Gaillard started this project — Sep 24, 2025 05:52 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.