π¨ ChalkAI: From Rough Sketches to Professional Diagrams
The Spark of Inspiration
Have you ever sketched a brilliant idea on a whiteboard, only to realize it looked nothing like what you envisioned? We've all been thereβfrantically drawing flowcharts during meetings, creating system architecture diagrams that resemble abstract art, or struggling to make our hand-drawn concepts look presentable.
This frustration became the catalyst for ChalkAI. We wanted to bridge the gap between the raw creativity of freehand sketching and the polished professionalism of digital diagrams. The question was simple: What if AI could understand our messy sketches and transform them into publication-ready diagrams?
The idea crystallized when we discovered that while design tools like Figma and diagram software like Lucidchart exist, they often interrupt the creative flow. You spend more time fighting with alignment tools and choosing the right connector than actually thinking. We wanted something differentβa canvas where you could think freely, sketch naturally, and let AI handle the refinement.
What We Learned
1. The Power of Multimodal AI
Our biggest revelation came from working with Google Gemini 2.5 Flash. We initially underestimated how well modern AI could interpret visual information combined with textual context. The model doesn't just see pixelsβit understands intent.
For example, when given a rough sketch with the prompt "authentication flow," Gemini recognizes:
- Boxes as process steps
- Arrows as data flow
- Relative positioning as sequence importance
- Even incomplete shapes as intentional elements
The mathematical representation of this can be thought of as a multimodal embedding space where:
$$f(I, T) \rightarrow D$$
Where:
- $I$ represents the input image (sketch)
- $T$ represents the textual intent
- $D$ is the output diagram
- $f$ is the Gemini model's transformation function
2. Voice Input Changes Everything
Adding voice recognition wasn't in our original plan, but it became one of the most valuable features. We learned that users think differently when speaking versus typing. Voice descriptions tend to be:
- More natural and conversational
- Richer in context
- Faster to input
We implemented an idle detection system with a 4.5-second threshold ($t_{idle} = 4500ms$), which we found through experimentation to be the optimal balance between:
$$\text{UX Quality} = \frac{\text{Recognition Accuracy}}{\text{Wait Time}}$$
Too short, and users feel rushed. Too long, and it breaks their flow.
3. The Selection Context Problem
One of our most interesting challenges was handling partial canvas refinement. Initially, we processed the entire canvas every time. But users wanted to refine just a portionβsay, fixing one flowchart branch while keeping others intact.
We implemented selection-aware export, which creates a bounded export based on user selection:
$$\text{Export Region} = \begin{cases} \text{Selection Bounds} & \text{if shapes selected} \ \text{Full Canvas} & \text{otherwise} \end{cases}$$
This seemingly small feature dramatically improved the user experience.
4. Real-Time Feedback is Non-Negotiable
We learned that users need immediate visual confirmation. Our initial implementation had a several-second black hole between clicking "Generate" and seeing results. This created anxiety and made users unsure if anything was happening.
We added:
- Loading states with glassmorphism effects
- Preview panels that appear before the full generation completes
- Smooth Framer Motion animations with spring physics:
$$x(t) = x_0 + (x_1 - x_0) \cdot \text{easeInOut}(t)$$
The difference in perceived performance was remarkable, even though actual processing time remained the same.
How We Built It
Architecture Overview
ChalkAI is built on a modern Next.js 16 App Router architecture with three core layers:
βββββββββββββββββββ
β Frontend UI β β React 19 + TypeScript
βββββββββββββββββββ€
β API Routes β β Next.js Edge Functions
βββββββββββββββββββ€
β AI Service β β Vercel AI SDK + Gemini
βββββββββββββββββββ
1. The Canvas Layer (tldraw Integration)
We chose tldraw v4 as our whiteboard foundation because it provides:
- Production-ready drawing primitives
- Built-in undo/redo with operational transforms
- Performant rendering (uses
<canvas>with WebGL) - Extensible architecture
The core integration in whiteboard.tsx:
const [editor, setEditor] = useState<Editor | null>(null);
const handleMount = useCallback((editor: Editor) => {
setEditor(editor);
}, []);
return (
<Tldraw
onMount={handleMount}
// ... configuration
/>
);
2. The AI Pipeline
Our diagram refinement pipeline follows this flow:
Step 1: Canvas Export
const imageData = await exportToBase64(editor);
// Uses tldraw's native export with custom bounds
Step 2: Prompt Construction We discovered that prompt engineering was crucial. Our final prompt structure:
SYSTEM: You are a professional diagram designer...
USER: [Base64 Image] + "Create a refined version of: {user_intent}"
The prompt includes specific constraints:
- Output dimensions: 1024Γ1024px
- Style guidelines: clean, minimal, professional
- Color palette restrictions
- Text rendering requirements
Step 3: Gemini Processing
const result = await generateImage(model, {
prompt: enhancedPrompt,
image_data: imageData,
config: {
temperature: 0.7, // Balanced creativity
topK: 40, // Token sampling
topP: 0.95 // Nucleus sampling
}
});
We tuned these hyperparameters to balance between:
- Temperature ($\tau$): Controls randomness
- Too low (0.2): Overly rigid, loses creativity
- Too high (1.2): Inconsistent results
- Our sweet spot: 0.7
Step 4: Result Integration
The generated image is returned as Base64 PNG, which we load into an <Image> component with:
- Automatic aspect ratio preservation
- Error boundary protection
- Lazy loading for performance
3. Voice Recognition System
We implemented the Web Speech API with custom idle detection:
const recognition = new window.webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
let idleTimer: NodeJS.Timeout;
recognition.onresult = (event) => {
clearTimeout(idleTimer);
idleTimer = setTimeout(() => {
recognition.stop();
processVoiceInput(transcript);
}, 4500); // Our magic number
};
4. Keyboard-Driven Workflow
We added keyboard shortcuts inspired by Vim and VS Code:
Tab: Accept suggestion (most natural "forward" action)Esc: Reject suggestion (universal "cancel")
The implementation uses native browser events:
useEffect(() => {
const handleKeyDown = (e: KeyboardEvent) => {
if (e.key === 'Tab' && generatedImage) {
e.preventDefault();
acceptSuggestion();
}
};
window.addEventListener('keydown', handleKeyDown);
return () => window.removeEventListener('keydown', handleKeyDown);
}, [generatedImage]);
5. UI/UX Design Decisions
Glassmorphism Theme We used Tailwind CSS with custom blur and transparency utilities:
.glass {
background: rgba(255, 255, 255, 0.05);
backdrop-filter: blur(10px);
border: 1px solid rgba(255, 255, 255, 0.1);
}
Animation System Framer Motion powers all our transitions with spring physics:
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
transition={{
type: "spring",
stiffness: 300,
damping: 30
}}
/>
Spring constant: $k = 300$, Damping ratio: $\zeta = 30$
The Challenges We Faced
Challenge 1: Image Quality vs. Processing Speed
Problem: High-resolution exports (2048Γ2048) produced stunning results but took 8β12 seconds to process.
Solution: We settled on 1024Γ1024 as the optimal resolution, which gave us:
- Processing time: 3β5 seconds
- Quality: Sufficient for most use cases
- File size: ~200KB (manageable for web)
The relationship between resolution $R$ and processing time $T$ was roughly quadratic:
$$T(R) \approx k \cdot R^2$$
Where $k \approx 3 \times 10^{-6}$ in our testing.
Challenge 2: Context Window Limitations
Problem: Gemini 2.5 Flash has a context window limit. Large canvases with detailed annotations hit this ceiling.
Solution:
- Implemented automatic image compression for exports exceeding 5MB
- Added smart cropping for selection-based refinement
- Warned users when canvas complexity was too high
Challenge 3: Voice Recognition Accuracy
Problem: Web Speech API struggles with technical terminology like "Kubernetes pod" or "OAuth 2.0 flow."
Solution:
- Added a transcript review step
- Allowed users to edit voice input before submission
- Created a custom vocabulary hint system (though browser support is limited)
Challenge 4: Cross-Browser Compatibility
Problem: webkitSpeechRecognition only works in Chromium browsers.
Solution:
- Feature detection with graceful degradation
- Clear messaging when voice features aren't available
- Fallback to text-only input
const hasSpeechRecognition =
'webkitSpeechRecognition' in window ||
'SpeechRecognition' in window;
Challenge 5: LightningCSS Binary Issues
Problem: Windows builds failing with lightningcss.win32-x64-msvc.node errors.
Solution: We documented a manual workaround and submitted an issue to the Next.js team. This taught us the importance of:
- Platform-specific testing
- Clear troubleshooting documentation
- Community engagement
Performance Optimizations
1. Debounced Drawing Detection
To avoid triggering voice idle detection on every stroke:
const debounce = (fn: Function, delay: number) => {
let timeoutId: NodeJS.Timeout;
return (...args: any[]) => {
clearTimeout(timeoutId);
timeoutId = setTimeout(() => fn(...args), delay);
};
};
2. Lazy Image Loading
Generated images only load when visible in the preview panel:
<Image
src={generatedImage}
loading="lazy"
decoding="async"
/>
3. Edge Function Deployment
Our /api/complete-diagram route uses Vercel Edge Runtime for:
- Global distribution (lower latency)
- Automatic scaling
- Zero cold starts
What's Next?
While ChalkAI is functional, we have exciting plans:
- Collaborative Features: Real-time multiplayer sketching with WebRTC
- History & Versions: Track diagram evolution with a timeline UI
- Export Formats: SVG, PDF, and PowerPoint support
- Template Library: Pre-built diagram templates users can start from
- Custom Style Training: Fine-tune Gemini on user-specific design preferences
- Mobile App: Native iOS/Android apps with Apple Pencil support
Reflection
Building ChalkAI taught us that the best tools disappear into the background. Users shouldn't think about our technologyβthey should think about their ideas. Every design decision, from the 4.5-second idle timeout to the glassmorphism UI, was made to reduce friction.
The most rewarding moment was watching someone sketch a complex system architecture, click a button, and see their rough idea transformed into something they were genuinely excited to share with their team.
That's the magic of combining human creativity with AI capability. We're not replacing designersβwe're giving everyone superpowers to communicate visually.
Try ChalkAI: Live Demo
Built by: Abhishek Sonje & Jaydeep
"The best interface is the one you never notice."
Log in or sign up for Devpost to join the conversation.