Inspiration
The inspiration for the project arises from a team member, who was moved by a story recounted by a friend. This individual, recently having moved from Pakistan to Canada, expressed that due to the language barrier, he experienced overwhelming social anxiety. As a result, he was fearful to be in social situations such as dining at a restaurant or simply going to the grocery store.
With focus on the restaurant industry, we realized that the dining experience could be very daunting when presented with a language-barrier, and we thought that there should be a solution to this problem.
Further motivation stems from the desire to cut costs and reduce human error. Integrating an automated system would eliminate the need for sentient servers ultimately allowing for automation of the restaurant industry.
What it does
EZ-Serve automates restaurant tableside service enhancing customer experience with its multilingual capability and convenience.
In its current state, users, or rather customers, are able to interact with EZ-Serve via audio commands in the same manner as a traditional server. EZ-Serve promotes multilingualism and will provide service in any language supported by Google. From a question about the menu, a description of an item, ordering items with or without modification, to checking out, EZ-Serve provides a limited but flexible toolbox to satisfy customer needs. Once any prompt execution is complete, EZ-Serve responds with natural language audio output. Upon a successful order, processed in the Square network, the user can view their active order via the EZ-Serve virtual terminal.
How we built it
Our project stack includes: Python3, a Langchain custom agent, a Weaviate Vector database, use of Google Cloud Platform APIs (VertexAI, Speech to text, Text to Speech, and Translation), Square API, PyQt5, and a Raspberry Pi 4 with some hardware (keyboard, mouse, microphone, speakers).
We started with a desire to get some experience working with LLM agents, so research into that area was a priority. At the time, Langchain was a clear winner.
We also needed to figure out what we needed to communicate with the Square API. Starting with the API explorer and a dummy menu on square sandbox, we pieced together what endpoints we would need to interact with for our product. Doing a quick look into the square python sdk, we didn’t foresee any issues with the API.
We started with a basic, preset langchain agent, with a single order tool. Its pure function was to take in unstructured text, and output a structured JSON order item in the format we wanted to work with. This was a considerable challenge, and took a few days of prompt engineering to get it to reliably output what we needed.
After we realized that it was possible to generate that JSON reliably, we went on to abstract the square order into a class, and began looking into the audio APIs available from Google.
Once we got the order_tool working and placing orders to the square api, we started incorporating the google translate/transcribe/tts apis into the flow. This gave us a better idea of how the final product was going to look.
About 3 weeks before submission is when the Pi we ordered got delivered. This is when the team divided work and we had Alex working on virtual terminals, Vaughan working on the agent, and Dom working on getting audio working on the Pi. This is when debugging and cooperation became a challenge. The agent at the time was running embedded chromadb on windows, but that doesn’t run on linux, so we had to switch to weaviate quickly to continue development.
About 2 weeks before submission is when we pivoted to a custom agent. This allowed much more freedom but also much more room for the agent to hallucinate. Overall it improved the performance and robustness of the agent, but resulted in 3x the amount of debugging than expected.
At around this time is when we finally worked out how we wanted the async record flow to work. We initially had been working with PyAudio, but it had a bug where the first 0.5s of recording was empty, which felt awkward for user experience. Knowing we needed an alternative, we found sounddevice, which, after testing, we found solved our problem. Implementation of sounddevice was a challenge in itself. It was a completely different audio module using callbacks. Eventually we figured it out and got the system running smoothly. The last two weeks before submission have been filled with debugging and bouncing the pi between team members at each meeting to maximize the amount of work we could get done on it.
Challenges we ran into
We encountered many challenges throughout development. Listed below are many of the software and hardware challenges we encountered during the development process, these include:
- Getting audio to work was a real pain, and is currently only supported on the Pi.
- PyAudio configuration did not work, so had to use py-sounddevice, which was challenging.
- The first version of the agent would call order_tool for each individual item, this was an inefficient use of LLM calls so we converted order_tool to accept a list of items. This drastically improved the speed of the program and lowered calls.
- We realized that any agent that has to interact with humans is going to encounter plenty of edge cases that we didn’t have the time to exhaust, so we implemented plenty of error checking, and eventually saw the need for human-in-the-loop validation.
- Initially we performed a bunch of regex and brute-forced a list to find order items
- upgraded to a vectordb.
- Many of issues minimizing the amount of text going through the LLM
- changed to a custom agent framework to batch order calls.
- Langchain agent prompt has a character limit.
- Use of asyncio and threading concurrently.
- Mapping windows ports to unix ports.
- ALSA audio configuration was painful.
- Raspberry pi has a maximum power ouput
- Difficulties outputting audio from .wav files
- Some things work, and dont work, in Python. Everything, mostly, works in Python3.
- Geany IDE doesn't like the simultaneous use of spaces and tabs.
- More
Accomplishments that we're proud of
Overall, of all the accomplishments and milestones we feel proud of, the main three are as follows:
- Implementation of Human in the loop validation in Custom LLM Agent.
- Sticking to our Linear project plan throughout the Software Development Life Cycle.
- Successful integration/configuration of ALSA audio sources/sinks on Raspberry Pi 4.
What we learned
As a team, we are all passionate about self-development and are eager to tackle complex challenges. In such spirit, we took the opportunity to learn as much as we could during our development process.
- Embedded development is difficult.
- Async audio is not fun to debug.
- Prompt Engineering
- Langchain Agents
- Vector databases
- Google APIs for speech
- VertexAI
- Square API
What's next for EZ-Serve
The next steps for EZ-Serve would be to implement a prototype concept EZ-Server hardware device, integrating a square terminal (nfc reader, etc) with a built-in hardware screen to replace the virtual terminals. Furthermore, the device would be activated by a QR-code redirecting a user to a React Single Page Application for the restaurant menu and potentially additional control/support for EZ-Serve. Lastly, just continually improve on the LLM, adding tools and refining output. View “Ultimate Concept - Diagram” on GitHub repository.
Built With
- gcp
- langchain
- pyqt5
- python
- raspberry-pi
- square
- weaviate
Log in or sign up for Devpost to join the conversation.