SPICE

Inspiration

Image editing with diffusion model is powerful today, but sometimes frustrating, that you can not realize controlled editing with details or user DIY typein. And the instruction and reference content is hard to describe.

The truth is that you write a prompt and hope the model guesses what you meant. If it fails, you tweak the prompt, try a different seed, and repeat. The result is unpredictable and completely disconnected from user intent.

So, we want to change this workflow, and we propose \textbf{SPICE}. The inspiration behind SPICE is a simple question: "what if you could just show the model exactly what you want, the editing region and quicly show the sketch?"

What it does

SPICE has a two-layer approach, a "hint layer", where you can paint color and structure cues directly onto the original image; a "mask layer", where you can define the edit region.

How we built it

SPICE is a image editing tool built base on FLUX.1-Fill, a state-of-the-art inpainting model from Black Forest Labs.

Frontend: React + Vite + TypeScript. The canvas editor is built from scratch using the HTML5 canvas API with pointer event handling.

Backend: Node.js + Fastify. The backend receives the three image layers as base64 data URLs, composited the hint into the source image within the masked region, and dispatches the editing task to a local Python inference script base on Pytorch.

Challenges we ran into

Environment config

FLUX.1-Fill-dev is a gated repository on Huggingface, and we need to config a environment with package like torch, diffusers, and transformers.

Other device adaptive

We test web is OK, however, we found that the when we want to swipe the screen on phone, the default activated brush will accidentally touch the image. So this motivated us to add a default state, that we called "Pointer brush".

Inpainting hint blending function

finding the right Gaussian blur radius and composite mode for the hint-to-source blend required iteration. Too much blur loses the hint's structural information; too little creates a hard seam that confuses the model at mask boundaries.

Accomplishments that we're proud of

We adjust the mask as transparent, thus user can make a difference between where they want to edit and where the model want to edit. This achieves the alignment between human vision and computer vision.
End-to-end latency of ~4 seconds per edit on a single consumer-grade GPU makes the iterative workflow feel fluid and responsive.
The "Use result as new source" enables the core SPICE loop with generate, inspect, load result, refine, without any manual file export/import.

What we learned

How to deloy the image-editing model on web with frontend and backend.
How to config a pytorch environment for image-editing work, like pip install, and conda environment creation.