Inspiration

HandHold was born from a simple observation: while AI chatbots can answer questions, they can't actually show users how to accomplish tasks on websites. We realized that combining voice interaction with visual demonstration could create a revolutionary self-driving customer support experience. Imagine having a knowledgeable friend who can not only tell you how to do something but actually show you by moving your cursor and interacting with the interface in real-time.

What it does

HandHold is a self-driving customer support system that transforms how users interact with websites. Using voice commands in any language, users can ask questions about how to accomplish tasks, and HandHold doesn't just tell them,it shows them. Through a combination of natural language processing and browser automation: Provides real-time visual demonstrations by controlling the cursor and highlighting relevant elements Offers multilingual voice-based interaction using DeepL for seamless translation Simulates human-like interactions by clicking, typing, and scrolling through interfaces Creates an accessible experience for users of all technical abilities Reduces support tickets by empowering users to learn through visual demonstration

How we built it

We engineered HandHold using a sophisticated stack of modern technologies: Core Infrastructure: React/TypeScript for the frontend interface Vapi.ai for voice interaction and natural conversation flow DeepL API for real-time language translation Rime for sophisticated browser automation and cursor control Technical Architecture: Custom browser action protocol for translating natural language commands into UI interactions Cursor animation system for smooth, human-like movements Element highlighting system for visual feedback State management for coordinating voice, translation, and browser actions Integration Layer: Anthropic's model context protocol for enhanced reasoning capabilities WebSocket-based real-time communication Event-driven architecture for handling voice and browser events

Challenges we ran into

Synchronizing voice interactions with visual demonstrations while maintaining natural timing Implementing smooth cursor animations that feel human-like rather than robotic Handling complex DOM interactions across different website structures Managing state between multiple async processes (voice, translation, browser actions) Ensuring cross-browser compatibility for cursor control and element highlighting

Accomplishments that we're proud of

Created a first-of-its-kind visual demonstration system that actually shows users how to accomplish tasks Successfully integrated three sponsor tools (Vapi, DeepL, and Rime) in a novel way that enhances the user experience Built a scalable architecture that can be easily extended to support more languages and browser actions Achieved natural-feeling cursor movements that users can follow and learn from Developed a system that makes web navigation accessible to users regardless of technical expertise or language

What we learned

The importance of timing in human-computer interaction, too fast feels robotic, too slow feels unnatural How to coordinate multiple AI services (voice, translation, automation) into a cohesive experience The complexities of browser automation across different website structures The value of visual demonstration in learning and user support Techniques for making AI interactions feel more human and approachable

What's next for HandHold

Integrate with popular CMS platforms for easy installation, looking to tackle Zendesk first.

Built With

  • deepl
  • mcp
  • rime
  • vapi
Share this project:

Updates