Inspiration
Our project, A Picture is Worth a Thousand Words, was inspired by our desire to tackle two major challenges when working with large language models: reducing token costs and making complex conversations easier to understand. We observed that traditional text-based chat consumes tokens quickly, driving up costs for long interactions. At the same time, navigating lengthy conversations can be overwhelming, and it’s often difficult to see the structure and flow of the dialogue. By combining visual representation with efficient token usage, we set out to make conversations both cheaper and clearer.
What it does
The platform solves both problems through a set of integrated features. First, it converts conversations into image-based context, drastically reducing token usage. Users can then explore these interactions via interactive conversation tree visualizations, which reveal the structure and progression of the dialogue. Real-time token analytics track usage and potential savings, which can range from 60-80%. Additional capabilities include exporting conversation trees and managing conversation branches for easier navigation.
How it works:
- Retrieves conversation path
- Converts to image
- Sends to Gemini Vision API
- Calculates savings
- Updates tree
- Token Savings Formula: Text: (Characters + Overhead) ÷ 4 Image: ⌈Height ÷ 768⌉ × 258 Savings: Text - Image
How we built it
Tech Stack:
- Backend: Python, FastAPI, Gemini API, Google Cloud, PIL, Pydantic
- Frontend: React, TypeScript, React Flow, Vite, React Markdown ###Architecture Diagram: React UI → FastAPI → Gemini Vision API
Key Components:
Image Context Storage (gemini_image_context_chat.py) API Server (api_server.py) Conversation Tree UI (Context_Trees.tsx) Analytics Dashboard (TokenAnalytics.tsx)
Challenges we ran into
Accurately calculating token usage, particularly for short conversations, was tricky. Optimizing images to efficiently store dense text while remaining readable proved to be more complex than we thought.
Accomplishments that we're proud of
Despite these challenges, the project achieved major milestones. A Picture is Worth a Thousand Words delivers 60-80% token cost savings, tracks real-dollar costs in real time, and provides interactive conversation trees that make exploring and understanding complex dialogues intuitive and engaging.
What we learned
We learned how vision tokens (768×768 tiles, 258 tokens each) can encode large amounts of text efficiently. Most importantly, we confirmed that scalability is critical: small efficiencies multiply rapidly in long conversations, translating to meaningful real-world cost savings.
What's next for a Picture is Worth a Thousand Words
On the efficiency side, we plan to explore smarter image compression and adaptive tiling, inspired by OCR research, to further reduce token usage without sacrificing accuracy. On the visualization side, we aim to enhance conversation tree intelligence: automatically summarizing branches, highlighting decision points, and predicting likely next interactions, drawing from neural tree search techniques.


Log in or sign up for Devpost to join the conversation.