a picture is worth a thousand words

Inspiration

Our project, A Picture is Worth a Thousand Words, was inspired by our desire to tackle two major challenges when working with large language models: reducing token costs and making complex conversations easier to understand. We observed that traditional text-based chat consumes tokens quickly, driving up costs for long interactions. At the same time, navigating lengthy conversations can be overwhelming, and it’s often difficult to see the structure and flow of the dialogue. By combining visual representation with efficient token usage, we set out to make conversations both cheaper and clearer.

What it does

The platform solves both problems through a set of integrated features. First, it converts conversations into image-based context, drastically reducing token usage. Users can then explore these interactions via interactive conversation tree visualizations, which reveal the structure and progression of the dialogue. Real-time token analytics track usage and potential savings, which can range from 60-80%. Additional capabilities include exporting conversation trees and managing conversation branches for easier navigation.

How it works:

Retrieves conversation path
Converts to image
Sends to Gemini Vision API
Calculates savings
Updates tree
Token Savings Formula: Text: (Characters + Overhead) ÷ 4 Image: ⌈Height ÷ 768⌉ × 258 Savings: Text - Image

How we built it

Tech Stack:

Backend: Python, FastAPI, Gemini API, Google Cloud, PIL, Pydantic
Frontend: React, TypeScript, React Flow, Vite, React Markdown ###Architecture Diagram: React UI → FastAPI → Gemini Vision API

Key Components:

Image Context Storage (gemini_image_context_chat.py) API Server (api_server.py) Conversation Tree UI (Context_Trees.tsx) Analytics Dashboard (TokenAnalytics.tsx)

Challenges we ran into

Accurately calculating token usage, particularly for short conversations, was tricky. Optimizing images to efficiently store dense text while remaining readable proved to be more complex than we thought.

Accomplishments that we're proud of

Despite these challenges, the project achieved major milestones. A Picture is Worth a Thousand Words delivers 60-80% token cost savings, tracks real-dollar costs in real time, and provides interactive conversation trees that make exploring and understanding complex dialogues intuitive and engaging.

What we learned

We learned how vision tokens (768×768 tiles, 258 tokens each) can encode large amounts of text efficiently. Most importantly, we confirmed that scalability is critical: small efficiencies multiply rapidly in long conversations, translating to meaningful real-world cost savings.

What's next for a Picture is Worth a Thousand Words

On the efficiency side, we plan to explore smarter image compression and adaptive tiling, inspired by OCR research, to further reduce token usage without sacrificing accuracy. On the visualization side, we aim to enhance conversation tree intelligence: automatically summarizing branches, highlighting decision points, and predicting likely next interactions, drawing from neural tree search techniques.