NapkinToCode

Inspiration

Vision models have made a lot of progress recently. Is it possible for them to translate images into code?

What it does

Takes an image of a workflow and translates that into executable python code.

How we built it

Using Python with ( currently ) OpenAI models. We use a vision model to intepret the image and using that output, feed it into gpt-4o to generate code given some system prompts. The result is code that represents the workflow illustrated which can execute.

Challenges we ran into

Hallucinations - there are some times when the code is not executable. Usually a rerun solves the problem Prompt engineering - took some time to get to a set of prompts that will produce reliable results