🎨 StoryFrame AI - Project Story

πŸ’‘ The Spark of Inspiration

The idea for StoryFrame AI was born during a quiet evening when I was reading a literary novel. As I immersed myself in the narrative, I found myself constantly pausing to visualize the scenes in my mind. "What if I could actually see these moments?" I wondered. The descriptions were vivid, but I craved something moreβ€”a visual representation that could bring the story to life.

I thought: "How can I make this visualization happen for anyone, anywhere, while reading anything online?"

For months, this idea sat dormant. I lacked the technical knowledge and tools to make it a reality. But everything changed when Google announced Gemini Nano and Gemini 2.5 Flash Image. The combination of on-device AI for privacy-focused text analysis and cost-effective cloud image generation was the breakthrough I needed.

I realized: This is possible now. And not just for novelsβ€”for articles, PDFs, educational content, and any text a user encounters online. That's when StoryFrame AI was born.


πŸ› οΈ How I Built It

Tech Stack & Architecture

StoryFrame AI is a Chrome Extension built with a hybrid AI architecture that leverages both client-side and cloud-based models:

Frontend

  • HTML5, CSS3, JavaScript (ES6+) - Pure vanilla JS, no frameworks
  • Chrome Extension APIs - Manifest V3, Side Panel API, Context Menus API, Storage API
  • Modern UI/UX - Gradient-based design with smooth animations and progressive loading

AI & APIs

  1. Gemini Nano (Prompt API) - On-device text analysis and scene breakdown
  2. Summarizer API - On-device text summarization for long content (70+ words)
  3. Gemini 2.5 Flash - Cloud-based prompt generation fallback
  4. Gemini 2.5 Flash Image - Primary cloud image generation
  5. OpenRouter - Alternative API gateway for broader model access

Key Features

  • Multi-tab support with per-tab state management
  • Complete history system with local storage
  • Smart text processing with automatic summarization
  • Fallback mechanisms for reliability
  • Context menu integration that works even on PDFs

Architecture Decisions

The hybrid approach was crucial:

User selects text
    ↓
[On-Device] Gemini Nano analyzes narrative structure (privacy-first)
    ↓
[On-Device] Summarizer API condenses long text (if needed)
    ↓
[Cloud] Gemini 2.5 Flash Image generates comic-style visualization
    ↓
User views multi-panel story in side panel

Why this architecture?

  • Privacy: Sensitive text analysis happens locally
  • Quality: Cloud models produce high-quality images
  • Cost: Efficient resource usage with on-device processing
  • Reliability: Multiple fallback paths ensure it always works

πŸŽ“ What I Learned

1. Chrome Built-in AI is a Game-Changer

Before this project, I had no idea Gemini Nano existed. I had been using Ollama and other local models for sensitive file processing, running heavy LLMs on my machine. Discovering that Chrome now has built-in AI capabilities was mind-blowing:

  • Text analysis happens instantly without downloads
  • No GPU required on user's machine
  • Privacy-preserving by design
  • Seamless integration with web content

This opened my eyes to the future of client-side AI.

2. Chrome Extension Development is Complex (But Rewarding)

This was my first-ever Chrome extension, and the learning curve was steep:

Challenge #1: Service Workers vs. Background Scripts

  • Manifest V3 requires service workers (no persistent background pages)
  • Had to completely rethink state management
  • Learned about importScripts() vs. ES6 modules

Challenge #2: PDF Interaction

  • Chrome loads PDFs in a highly secure sandboxed environment
  • Content scripts cannot inject into PDF viewers
  • Discovered this limitation after hours of debugging
  • Solution: Created a workaround using chrome.storage.local to pass data and context menus for triggering

Challenge #3: Side Panel API

  • Relatively new API with limited documentation
  • Had to experiment extensively to get it working across different scenarios
  • Learned about window/tab-specific panel opening restrictions

Challenge #4: Cross-Origin Restrictions

  • Can't directly manipulate certain web contexts
  • Had to design message-passing architecture between content scripts, background service worker, and side panel

3. AI Model Limitations & Workarounds

Gemini Nano Context Window Issue:

The biggest technical challenge was Gemini Nano's limited context window and processing speed:

  • Maximum ~3,000 characters (after summarization)
  • Takes 60-180 seconds for complex prompts
  • Performance varies significantly based on device hardware
  • Even on my high-end CPU, it struggled with longer texts

My Solution: Multi-Layered Fallback System

// Intelligent fallback hierarchy
1. Try Gemini Nano (80-second timeout)
   ↓ [if timeout/error]
2. Use Summarizer API to condense text
   ↓ [if still too long]
3. Fall back to Gemini 2.5 Flash (cloud)
   ↓ [if Google AI fails]
4. Use OpenRouter as final fallback

This ensures reliability without sacrificing user experience.

4. The Genre-Agnostic Challenge

Initially, my prompts forced a "funny, comedic" style on ALL content. This was terrible for serious literature, dramatic scenes, or horror stories.

Learning: AI prompt engineering requires context-aware instructions. I refactored the system prompts to:

  • First analyze the text's genre and tone
  • Adapt visual style, colors, and expressions accordingly
  • Match the original narrative's emotional intent

🚧 Challenges I Faced

Challenge #1: Image Generation Cost πŸ’Έ

The biggest barrier was cost. While Gemini Nano is free (on-device), Gemini 2.5 Flash Image is not free unless you have billing enabled on your Google Cloud project.

Current State:

  • Google AI Studio requires billing for image generation
  • No generous free tier like text models
  • This limits accessibility for users without API credits

My Workaround:

  • Added OpenRouter support as an alternative
  • Used my existing OpenRouter credits during development
  • Created dual-API architecture so users can choose

Future Hope: I believe (and hope) that as Gemini models mature, Google will make image generation more accessible, potentially offering a free tier similar to text models. Once that happens, StoryFrame AI will become truly accessible to everyone.

Challenge #2: Gemini Nano Performance ⏱️

Even with high-end hardware, Gemini Nano is:

  • Slow: 60-180 seconds for prompt generation
  • Context-limited: Can't handle long texts directly
  • Unpredictable: Timeout issues on complex prompts

My Solution:

  1. Implemented Summarizer API to pre-process long texts
  2. Increased timeout to 180 seconds (3 minutes)
  3. Optimized session parameters (temperature: 0.7, topK: 30)
  4. Added comprehensive cloud fallback to Gemini 2.5 Flash

Trade-off: While I wanted to keep everything on-device for privacy, reliability had to take priority. The hybrid approach ensures users always get results, even if it means using cloud APIs.

Challenge #3: Multi-Tab State Management

Initially, the extension used global state variables, causing chaos when users opened multiple tabs:

  • State would overwrite between tabs
  • Prompts from different stories would mix
  • Results would appear in the wrong tab

Solution: Implemented per-tab state management using a Map<tabId, state> structure, ensuring complete independence between tabs.


πŸ”¬ Technical Deep Dive

Why Summarizer API?

For texts over 70 words, I use the Summarizer API to:

  • Preserve key narrative elements (characters, emotions, actions)
  • Reduce token count for faster Nano processing
  • Maintain story coherence for visualization
const summarizer = await self.Summarizer.create({
  type: 'key-points',
  format: 'plain-text',
  length: 'medium',
  sharedContext: 'Preserve complete narrative story...'
});

Why Multi-Panel Comics?

Instead of generating separate images, I create one image with multiple panels because:

  • Maintains story continuity visually
  • Reduces API calls and cost
  • Creates a cohesive narrative flow
  • Mimics actual comic book reading experience

🎯 What Makes StoryFrame AI Special

  1. Privacy-First: Text analysis happens on your device
  2. Context-Aware: Adapts to any genre (comedy, drama, horror, romance)
  3. Reliable: Multiple fallback mechanisms ensure it always works
  4. Accessible: Works on PDFs, articles, novels, any text online
  5. Complete History: Track and download all your visualizations
  6. Multi-Tab Support: Use independently across browser tabs

πŸš€ Future Vision

Once Gemini 2.5 Flash Image becomes more affordable/free, StoryFrame AI will:

  • Remove cost barriers for users
  • Enable unlimited creative visualization
  • Become a standard tool for readers, educators, and creators
  • Potentially support real-time visualization as you scroll

πŸ™ Acknowledgments

This project wouldn't exist without:

  • Google's Chrome Built-in AI Challenge for inspiring innovation
  • Gemini Nano for bringing AI to the browser
  • The Chrome team for powerful extension APIs
  • OpenRouter for alternative API access during development

Built with ❀️ for the Google Chrome AI Challenge 2025

"Making stories visible, one panel at a time."

Built With

Share this project:

Updates