Inspiration

Vision models have made a lot of progress recently. Is it possible for them to translate images into code?

What it does

Takes an image of a workflow and translates that into executable python code.

How we built it

Using Python with ( currently ) OpenAI models. We use a vision model to intepret the image and using that output, feed it into gpt-4o to generate code given some system prompts. The result is code that represents the workflow illustrated which can execute.

Challenges we ran into

Hallucinations - there are some times when the code is not executable. Usually a rerun solves the problem Prompt engineering - took some time to get to a set of prompts that will produce reliable results

Accomplishments that we're proud of

The code generated does execute.

What we learned

Prompt can only take you so far. Would love to do fine tuning to get more predictable results.

What's next for NapkinToCode

More complex workflows

The workflow is extendable to other LLM providers ( Anthropic, Cohere, Meta ) and additional features ( audio or video playback ) can be added.

You can find out more with the slides here: https://docs.google.com/presentation/d/11NCw_F9JUDe-BCJn6XCtJV7cbpcmLKQD/edit?usp=sharing&ouid=109164369295310357886&rtpof=true&sd=true

Built With

Share this project:

Updates