Speech-to-Form

Inspiration

What## Inspiration

Ask any conversion expert and they will tell you that the best way to scare off potential customers is to present them with a huge form they have to fill in. Whether you want to buy something, subscribe to something, or register for a new service, you can be sure that at some point you will have to waste a few seconds of your life by filling in a form. This is especially annoying on devices without a physical keyboard.

Speech-to-Form solves this problem once and for all. Any website can add Speech-to-Form to their forms to make them more accessible and more easy to fill in for users on arbitrary devices.

What it does

To the user, Speech-to-Form adds a widget to the bottom right corner of a website. A button in the widget activates the user's microphone and starts an audio recording. After the recording is processed, the extracted data is used to automatically populate the form.

To the web developer, Speech-to-Form is a service that runs independently. Once Speech-to-Form has been provided with 1) a schema definition of the form and 2) a brief general description of the form's content, the service communicates with the frontend that displays the form and updates the fields.

How we built it

Our submission to the Google AI Hackathon comprises two components: the core library of Speech-to-Form, and a demonstration of Speech-to-Form on a locally running webserver. Note that the core library is versatile and can be used in all sorts of scenarios.

Core Library

At its core, Speech-to-Form is a layer of abstraction around the Vertex AI SDK for Node.js. An abstract class LanguageModel allows developers to use different LLMs to extract structured data from the transcripts. Similarly, an abstract class SpeechToText lets developers use arbitrary models to transcribe the user recording.

A controller called speech.js instantiates both LanguageModel and SpeechToText. It exposes three functions to the webserver: startRecording, stopRecording and processAudio.

Local Demo

An express.js-webserver exposes a website with a sample form, built using tailwindui, at http://localhost:3000. A WebSocket server running on the same port handles the communication between the Speech-to-Form widget and speech.js : It starts the recording, ends it and, await processing and sends the client the extracted data.

Challenges we ran into

Extracting structured data using Gemini is not trivial. We found that instructing Gemini to return structured data using the promp was unreliable. Specifically, a prompt such as ... You return valid JSON that can be parsed by another machine. Do NOT use any markup or templating, such as Markdown, in your response. would not be reliably followed by Gemini. Instead, Gemini would often return a JSON object in markdown, e.g.:

{'field_name': ...}

To circumvent the problem, we use Function Calling to extract structured data in JSON from the transcriptions when using Gemini10Pro and Gemini15Pro.

Accomplishments that we're proud of

We believe that Speech-to-Form can be an easy way for websites to make their forms more accessible to users. Using mobile keyboards or any keyboard at all can be a major accessibility hurdle for users. And although Speech-to-Form is not nearly as powerful as specialised solutions for voice control (e.g. Talon), it can still improve accessibility for users.

What we learned

We compared Speech-to-Form powered by Google Cloud Speech to Text and Gemini 1.5 Pro to a version that OpenAI's Whisper and GPT4-turbo. We found the two versions to be on-par in our comparison.

What's next for Speech-to-Form

While Speech-to-Form is already a powerful tool for webdevelopers, we want to add the following features:

streaming-support for Google Cloud Speech-to-Text to improve performance,
a more sophisticated demo-application that uses the browser's microphone API instead of the node.js library used in the demo,
improved support to iteratively edit and build on inputs made in a form, and
session support with a database, e.g. Firestore, to give the LLM more relevant context using RAG. it does

How we built it

Core Library

A controller called speech.js instantiates both LanguageModel and SpeechToText. It exposes three functions to the webserver: startRecording, stopRecording and processAudio.

Local Demo

Challenges we ran into

{'field_name': ...}

To circumvent the problem, we use Function Calling to extract structured data in JSON from the transcriptions when using Gemini10Pro and Gemini15Pro.

Accomplishments that we're proud of

What we learned

What's next for Speech-to-Form

While Speech-to-Form is already a powerful tool for webdevelopers, we want to add the following features:

streaming-support for Google Cloud Speech-to-Text to improve performance,
a more sophisticated demo-application that uses the browser's microphone API instead of the node.js library used in the demo,
improved support to iteratively edit and build on inputs made in a form, and
session support with a database, e.g. Firestore, to give the LLM more relevant context using RAG.

Built With

express.js
gemini1.0pro
gemini1.5pro
html
javascript
node.js
npm
speech-to-text
tailwindcss
tailwindui
vertexai
websockets

Updates

Fabian Benusch started this project — May 02, 2024 03:09 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.