🌟 Our Story 🌟

As we continue to innovatively reimagine the capabilities of our technology, we have a responsibility to reflect on who can realistically access these applications. Voice-powered navigation, a sci-fi classic brought to fruition across the tech industry, has become increasingly powerful with advancements in generative AI. We can now track specific acoustic patterns, detect voice activity, and transcribe/process complex commands more accurately than ever.

People use TikTok everywhere, from the lunch table to under the covers before bed. Having to solely navigate the app via touch is often a hindrance in situations where users are limited by whether they can use their hands. We wanted to create a solution to make the user experience easier for those who could appreciate it, and, more importantly, make the app accessible for those with mobility constraints who could not plausibly use the app before.

🤖 Who's Tikki? 🤖

Tikki is a generative AI-powered voice assistant that allows users to navigate the app with a variety of voice commands. Just wake Tikki up with a simple "Hi Tikki," and your wish is its command. Tikki can currently carry out a variety of commands, from page navigation to interacting with user profiles right from the For You page. Currently, the following languages are supported:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

🛠 Building Blocks 🛠

For this prototype, we recreated the TikTok web front-end from scratch via Next.js. Using a trained wake word detection neural network, the back-end matches real-time audio input with the acoustic patterns for "Hi TIkki." Once it identifies the wake word within a certain confidence interval, it records the rest of the user's speech until they stop talking via a voice activity detection model. This audio is then sent to the OpenAI Whisper large-v2 model for transcription into text, and then GPT-4 to match said text with a list of pre-set and routed commands. The front-end app then carries out the command in the "TikTok" interface.

Tools Used: OpenAI API (Whisper large-v2/GPT-4), Picovoice SDK (Cobra/Porcupine), Uvicorn, Socket.IO, Quart, AIOHTTP, NumPy

🚧 Cares & Concerns 🚧

We had two main concerns to take into account: user experience and user privacy.

User Experience: The core of our project focuses on making TikTok easier to use for all people. We wanted the recording experience to feel natural and intuitive, rather than something the user needs to think actively about to trigger. This meant adapting our algorithm to account for natural speech variations, being conscious of the amount of time it takes for the app to consider the user "done" talking, and making sure the entire process ran quickly enough not to annoy the user. We also had to think consciously about all the combinations and possibilities that users would attempt to command while using the app, and implement multi-language functionality considering TikTok's global user base.
User Privacy: Privacy was another huge concern, especially when it comes to recording and storing audio data. While it might have been easier to have the app recording the entire time, we wanted to make sure we weren't mindlessly recording all audio input. Our wake word detection model allows for the system to only process audio data after identifying "Hi Tikki" from acoustic patterns. It saves the command in a temporary path while it's being processed, and it is deleted immediately following transcription.

🏆 Not to brag, but... 🏆

Our wake word detection is impressively accurate, and the app uses generative AI models to understand even vaguely expressed commands. Tikki even understands long-winded dilemmas: "I really loved this video so much and thought it was so great and wish you could do something about it" led to an automated like.

🚀 What's Next for Tikki 🚀

This is a beta build, and due to time constraints, Tikki is only currently routed to navigate between pages, like videos, and scroll on the For You Page. With more time, we would build out functionality to handle all interactions users would have with the app.

Beyond that, our next steps would naturally be implementing Tikki in the mobile app, where most users access TikTok. We would also like to explore building a distinct character identity and personality for Tikki, similar to Iron Man's Jarvis assistant, to distinguish our assistant from similar systems (Siri, Cortana, Alexa) that lack that level of user interaction and humanization.

Built With

machine-learning
openai
python
react
typescript
voice-activity-detection

Submitted to

2024 TikTok TechJam

Created by

I created the complete frontend and backend for a TikTok clone using Next.js and Appwrite to showcase Tikki, our assistant. I also integrated Tikki with the application using Uvicorn and socket programming to demonstrate its capabilities.

Ryan Collins
Ready to innovate and elevate.
I designed the back-end AI audio recording and processing functionality. This included implementing our wake word model, a voice activity detection model, and all interactions with the OpenAI API (Whisper large-v2/GPT-4) for audio transcription.

Ooha Reddy
Computation and Design, Duke University.