Speechium in Use

What Inspired Us?

Our team has always been passionate about voice interfaces. As humans, we often prefer 1-on-1 calls rather than text messages to share complex ideas, so why to limit our interaction with computers to just a keyboard and mouse?

Specific to this Cal Hacks project, we noticed that while AI has become mainstream, many people with disabilities still lack full access to AI due to visual impairment or challenges with traditional keyboard and mouse inputs.

The end of this document contains takeaways from our startup.

Video Demo: https://youtu.be/ay7bvQEETsM

Slide deck for this project: https://docs.google.com/presentation/d/1tsx9EmV3Mx0U4Hcv9Hhew0FdMP4ybgdOQPDxFbFZ_DA/edit?usp=sharing

The Team

We are Wako.AI - a startup in the University of California CITRIS Foundry Cohort. Our main business vertical involves creating conversational voice AI agents for phone customer service. The team has a wealth of experience in relevant fields, with work experience from NASA, HP Enterprise, Rouzzy AI, and OSS Projects. We just graduated two days ago and look forward to working on this startup full-time.

How Did We Build it & What Did We Build?

We built a Chrome extension to connect the web browser to our voice interface. Using the extension, we extracted the DOM elements on every page load and compressed it using graph traversal to only extract visible content. Then, we provided Open AI's LLM with all available JS event objects to generate a sequence of actions satisfying what the user desires to do at that temporal state. Once the list of events is generated, we extract the appropriate event listeners and dispatch those events respectively.

Speechium's software, interactions, presentation, and slide deck were fully started and completed within the timeframe of this hackathon. The underlying end-to-end voice architecture was created before the hackathon.

Challenges We Faced

DOM Compression/Serialization

The biggest challenge was to reduce the size of DOM elements on a page to make it capable of processing under a 16K token limit. We solved this problem by traversing the DOM elements tree, only retrieving user-visible elements, allowing us to reduce the size by 20 times.

Event Listener Extraction

The challenge with event listener extraction was primarily with prompt engineering. When trying to generate an event and execute it on the appropriate element, the required target value for each relevant event listener is missing.

To tackle this, we minified the DOM when the page loads. Using the tags from the minified DOM, we tell the LLM to give a query selector of where the element should originate from, solving this problem.

Dispatch Event Generation

In the DOM, we have JS objects that point directly to the element. However, when sending over JSONs, the element target is missing, so it needs to be communicated where it comes from.

We solved this problem by using a feature released by Open AI two days ago. Using the target values for the event listeners and the JS events documentation, we generated Open AI function calls for the exact events.

Things We Learned

Open AI Function Calls

Open AI came out with Function Calls two days ago, so this is a new feature we learned to use. This was highly beneficial when dealing with Dispatch Event Generation to generate exact JS events.

Applied Graph Transversal Theories

As our team is all Computer Science majors, we have all had experience in using Graph Transversal in theory. Still, none of us had implemented Graph Transversal in a real-world scenario, as an underlying library often handles it. However, due to our novel approach, we had to implement graph transversal ourselves in a real-world application.

Unique Voice Technologies

Wako AI's voice interface isn't just simple speech-to-text; we own a wealth of proprietary architecture that allows people to interact with computers naturally through conversation rather than through command-based intent flows like our competitors.

Language is inherently imprecise, so our intent, context, and vector approach is designed to allow anyone to execute complicated tasks through the natural iteration of a conversation, narrowing down an initially imprecise intent to precise ones.

Some of the technologies we have that the competition lacks are context tracking, intent vectorization, and speech pausing. Live examples of these can be seen in our video demos playlist: https://tinyurl.com/speakdemo.

Real-time context tracking remembers where the speaker is in the conversation. With traditional voice AI software, the user has to “push-to-talk” or prompt the assistant with a keyword. This renders long-form conversations impractical. With our voice AI interface, our engine understands when a person’s thought is complete, so it will not cut them off during natural thinking pauses. Keywords or push-to-talk are no longer necessary - creating a seamless conversation that genuinely understands how a conversation evolves over time.
Intent vectorization means that the user's speech is calculated mathematically rather than through literal definitions. This makes the voice interface language-agnostic since user intent is mathematical, not through grammar - meaning that commands automatically work across multiple languages. Developers do not have to think about how a user may want to phrase a command. They can create a vector representing a specific command intent, and our voice interface will automatically match it with the live transcript.
Speech pausing prevents the AI from interrupting someone's thought, just as a person knows if a thought is completed before talking. A typical example is when a person pauses to think, such as when someone says, "Can you please call my..." A traditional AI agent would interrupt, not knowing the incomplete sentence, but our software can wait until the user decides who they want to contact. We can do this by predicting when a user's sentence will end in the future.

These are just three highlights of what helps make our AI natural to talk to. Much development has been dedicated to creating the most natural cloud-powered conversational flows.

Conclusion & Company Summary

Thanks to all the staff, sponsors, and UCB for making this event possible!

Here are some quick takeaways from our startup (TLDR):

Launching the startup, full-time
Incorporated as a Delaware C-Corp through CITRIS & WilmerHale Legal
Seeking Angel/Pre-Seed Investments
Full End-to-End Voice Architecture is Completed
Websites: wako.ai, youtube.com/speakdemo, chatspert.com
Slide Deck: https://docs.google.com/presentation/d/1tsx9EmV3Mx0U4Hcv9Hhew0FdMP4ybgdOQPDxFbFZ_DA/edit?usp=sharing