Inspiration
Vision models have made a lot of progress recently. Is it possible for them to translate images into code?
What it does
Takes an image of a workflow and translates that into executable python code.
How we built it
Using Python with ( currently ) OpenAI models. We use a vision model to intepret the image and using that output, feed it into gpt-4o to generate code given some system prompts. The result is code that represents the workflow illustrated which can execute.
Challenges we ran into
Hallucinations - there are some times when the code is not executable. Usually a rerun solves the problem Prompt engineering - took some time to get to a set of prompts that will produce reliable results
Accomplishments that we're proud of
The code generated does execute.
What we learned
Prompt can only take you so far. Would love to do fine tuning to get more predictable results.
What's next for NapkinToCode
More complex workflows
The workflow is extendable to other LLM providers ( Anthropic, Cohere, Meta ) and additional features ( audio or video playback ) can be added.
You can find out more with the slides here: https://docs.google.com/presentation/d/11NCw_F9JUDe-BCJn6XCtJV7cbpcmLKQD/edit?usp=sharing&ouid=109164369295310357886&rtpof=true&sd=true
Log in or sign up for Devpost to join the conversation.