🎨 ChalkAI: From Rough Sketches to Professional Diagrams

The Spark of Inspiration

Have you ever sketched a brilliant idea on a whiteboard, only to realize it looked nothing like what you envisioned? We've all been thereβ€”frantically drawing flowcharts during meetings, creating system architecture diagrams that resemble abstract art, or struggling to make our hand-drawn concepts look presentable.

This frustration became the catalyst for ChalkAI. We wanted to bridge the gap between the raw creativity of freehand sketching and the polished professionalism of digital diagrams. The question was simple: What if AI could understand our messy sketches and transform them into publication-ready diagrams?

The idea crystallized when we discovered that while design tools like Figma and diagram software like Lucidchart exist, they often interrupt the creative flow. You spend more time fighting with alignment tools and choosing the right connector than actually thinking. We wanted something differentβ€”a canvas where you could think freely, sketch naturally, and let AI handle the refinement.

What We Learned

1. The Power of Multimodal AI

Our biggest revelation came from working with Google Gemini 2.5 Flash. We initially underestimated how well modern AI could interpret visual information combined with textual context. The model doesn't just see pixelsβ€”it understands intent.

For example, when given a rough sketch with the prompt "authentication flow," Gemini recognizes:

  • Boxes as process steps
  • Arrows as data flow
  • Relative positioning as sequence importance
  • Even incomplete shapes as intentional elements

The mathematical representation of this can be thought of as a multimodal embedding space where:

$$f(I, T) \rightarrow D$$

Where:

  • $I$ represents the input image (sketch)
  • $T$ represents the textual intent
  • $D$ is the output diagram
  • $f$ is the Gemini model's transformation function

2. Voice Input Changes Everything

Adding voice recognition wasn't in our original plan, but it became one of the most valuable features. We learned that users think differently when speaking versus typing. Voice descriptions tend to be:

  • More natural and conversational
  • Richer in context
  • Faster to input

We implemented an idle detection system with a 4.5-second threshold ($t_{idle} = 4500ms$), which we found through experimentation to be the optimal balance between:

$$\text{UX Quality} = \frac{\text{Recognition Accuracy}}{\text{Wait Time}}$$

Too short, and users feel rushed. Too long, and it breaks their flow.

3. The Selection Context Problem

One of our most interesting challenges was handling partial canvas refinement. Initially, we processed the entire canvas every time. But users wanted to refine just a portionβ€”say, fixing one flowchart branch while keeping others intact.

We implemented selection-aware export, which creates a bounded export based on user selection:

$$\text{Export Region} = \begin{cases} \text{Selection Bounds} & \text{if shapes selected} \ \text{Full Canvas} & \text{otherwise} \end{cases}$$

This seemingly small feature dramatically improved the user experience.

4. Real-Time Feedback is Non-Negotiable

We learned that users need immediate visual confirmation. Our initial implementation had a several-second black hole between clicking "Generate" and seeing results. This created anxiety and made users unsure if anything was happening.

We added:

  • Loading states with glassmorphism effects
  • Preview panels that appear before the full generation completes
  • Smooth Framer Motion animations with spring physics:

$$x(t) = x_0 + (x_1 - x_0) \cdot \text{easeInOut}(t)$$

The difference in perceived performance was remarkable, even though actual processing time remained the same.

How We Built It

Architecture Overview

ChalkAI is built on a modern Next.js 16 App Router architecture with three core layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend UI   β”‚ ← React 19 + TypeScript
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   API Routes    β”‚ ← Next.js Edge Functions
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   AI Service    β”‚ ← Vercel AI SDK + Gemini
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. The Canvas Layer (tldraw Integration)

We chose tldraw v4 as our whiteboard foundation because it provides:

  • Production-ready drawing primitives
  • Built-in undo/redo with operational transforms
  • Performant rendering (uses <canvas> with WebGL)
  • Extensible architecture

The core integration in whiteboard.tsx:

const [editor, setEditor] = useState<Editor | null>(null);

const handleMount = useCallback((editor: Editor) => {
  setEditor(editor);
}, []);

return (
  <Tldraw 
    onMount={handleMount}
    // ... configuration
  />
);

2. The AI Pipeline

Our diagram refinement pipeline follows this flow:

Step 1: Canvas Export

const imageData = await exportToBase64(editor);
// Uses tldraw's native export with custom bounds

Step 2: Prompt Construction We discovered that prompt engineering was crucial. Our final prompt structure:

SYSTEM: You are a professional diagram designer...
USER: [Base64 Image] + "Create a refined version of: {user_intent}"

The prompt includes specific constraints:

  • Output dimensions: 1024Γ—1024px
  • Style guidelines: clean, minimal, professional
  • Color palette restrictions
  • Text rendering requirements

Step 3: Gemini Processing

const result = await generateImage(model, {
  prompt: enhancedPrompt,
  image_data: imageData,
  config: {
    temperature: 0.7,  // Balanced creativity
    topK: 40,          // Token sampling
    topP: 0.95         // Nucleus sampling
  }
});

We tuned these hyperparameters to balance between:

  • Temperature ($\tau$): Controls randomness
    • Too low (0.2): Overly rigid, loses creativity
    • Too high (1.2): Inconsistent results
    • Our sweet spot: 0.7

Step 4: Result Integration The generated image is returned as Base64 PNG, which we load into an <Image> component with:

  • Automatic aspect ratio preservation
  • Error boundary protection
  • Lazy loading for performance

3. Voice Recognition System

We implemented the Web Speech API with custom idle detection:

const recognition = new window.webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;

let idleTimer: NodeJS.Timeout;

recognition.onresult = (event) => {
  clearTimeout(idleTimer);

  idleTimer = setTimeout(() => {
    recognition.stop();
    processVoiceInput(transcript);
  }, 4500); // Our magic number
};

4. Keyboard-Driven Workflow

We added keyboard shortcuts inspired by Vim and VS Code:

  • Tab: Accept suggestion (most natural "forward" action)
  • Esc: Reject suggestion (universal "cancel")

The implementation uses native browser events:

useEffect(() => {
  const handleKeyDown = (e: KeyboardEvent) => {
    if (e.key === 'Tab' && generatedImage) {
      e.preventDefault();
      acceptSuggestion();
    }
  };

  window.addEventListener('keydown', handleKeyDown);
  return () => window.removeEventListener('keydown', handleKeyDown);
}, [generatedImage]);

5. UI/UX Design Decisions

Glassmorphism Theme We used Tailwind CSS with custom blur and transparency utilities:

.glass {
  background: rgba(255, 255, 255, 0.05);
  backdrop-filter: blur(10px);
  border: 1px solid rgba(255, 255, 255, 0.1);
}

Animation System Framer Motion powers all our transitions with spring physics:

<motion.div
  initial={{ opacity: 0, y: 20 }}
  animate={{ opacity: 1, y: 0 }}
  transition={{ 
    type: "spring",
    stiffness: 300,
    damping: 30 
  }}
/>

Spring constant: $k = 300$, Damping ratio: $\zeta = 30$

The Challenges We Faced

Challenge 1: Image Quality vs. Processing Speed

Problem: High-resolution exports (2048Γ—2048) produced stunning results but took 8–12 seconds to process.

Solution: We settled on 1024Γ—1024 as the optimal resolution, which gave us:

  • Processing time: 3–5 seconds
  • Quality: Sufficient for most use cases
  • File size: ~200KB (manageable for web)

The relationship between resolution $R$ and processing time $T$ was roughly quadratic:

$$T(R) \approx k \cdot R^2$$

Where $k \approx 3 \times 10^{-6}$ in our testing.

Challenge 2: Context Window Limitations

Problem: Gemini 2.5 Flash has a context window limit. Large canvases with detailed annotations hit this ceiling.

Solution:

  1. Implemented automatic image compression for exports exceeding 5MB
  2. Added smart cropping for selection-based refinement
  3. Warned users when canvas complexity was too high

Challenge 3: Voice Recognition Accuracy

Problem: Web Speech API struggles with technical terminology like "Kubernetes pod" or "OAuth 2.0 flow."

Solution:

  • Added a transcript review step
  • Allowed users to edit voice input before submission
  • Created a custom vocabulary hint system (though browser support is limited)

Challenge 4: Cross-Browser Compatibility

Problem: webkitSpeechRecognition only works in Chromium browsers.

Solution:

  • Feature detection with graceful degradation
  • Clear messaging when voice features aren't available
  • Fallback to text-only input
const hasSpeechRecognition = 
  'webkitSpeechRecognition' in window ||
  'SpeechRecognition' in window;

Challenge 5: LightningCSS Binary Issues

Problem: Windows builds failing with lightningcss.win32-x64-msvc.node errors.

Solution: We documented a manual workaround and submitted an issue to the Next.js team. This taught us the importance of:

  • Platform-specific testing
  • Clear troubleshooting documentation
  • Community engagement

Performance Optimizations

1. Debounced Drawing Detection

To avoid triggering voice idle detection on every stroke:

const debounce = (fn: Function, delay: number) => {
  let timeoutId: NodeJS.Timeout;
  return (...args: any[]) => {
    clearTimeout(timeoutId);
    timeoutId = setTimeout(() => fn(...args), delay);
  };
};

2. Lazy Image Loading

Generated images only load when visible in the preview panel:

<Image 
  src={generatedImage}
  loading="lazy"
  decoding="async"
/>

3. Edge Function Deployment

Our /api/complete-diagram route uses Vercel Edge Runtime for:

  • Global distribution (lower latency)
  • Automatic scaling
  • Zero cold starts

What's Next?

While ChalkAI is functional, we have exciting plans:

  1. Collaborative Features: Real-time multiplayer sketching with WebRTC
  2. History & Versions: Track diagram evolution with a timeline UI
  3. Export Formats: SVG, PDF, and PowerPoint support
  4. Template Library: Pre-built diagram templates users can start from
  5. Custom Style Training: Fine-tune Gemini on user-specific design preferences
  6. Mobile App: Native iOS/Android apps with Apple Pencil support

Reflection

Building ChalkAI taught us that the best tools disappear into the background. Users shouldn't think about our technologyβ€”they should think about their ideas. Every design decision, from the 4.5-second idle timeout to the glassmorphism UI, was made to reduce friction.

The most rewarding moment was watching someone sketch a complex system architecture, click a button, and see their rough idea transformed into something they were genuinely excited to share with their team.

That's the magic of combining human creativity with AI capability. We're not replacing designersβ€”we're giving everyone superpowers to communicate visually.


Try ChalkAI: Live Demo
Built by: Abhishek Sonje & Jaydeep

"The best interface is the one you never notice."

Built With

Share this project:

Updates