Raven

DJI Tello
Head tilt control UI
Speech Processing UI
Speech Processing UI 2

Inspiration

With recent innovations in the AI and wearables space we saw an opportunity to improve the way we interact with robots (from drones to computers and wheelchairs) by making it more seamless and accessible.

Our AirPods powered head tracking drone control solution was inspired by Apple’s spatial audio feature - a music feature where the audio tracks the orientation of your head.

Plus with the introduction of large multimodal foundation models, we decided to utilize their understanding of text and vision to create novel natural methods of human computer interaction.

What it does

Raven is a drone that can be controlled using Airpods, natural language, and gestures.

With Airpod Mode, Raven can be controlled by tilting one’s head on different axes, which can provide a fun, first-person flying experience. We extract head pose using the built in inertial measurement units located inside each airpod. With speech recognition mode, Raven can process long-form speech, and translate and execute this speech into discrete commands to execute. We hope these control options can help people who have a hard time using their arms to control.

With gesture recognition mode, Raven can interpret natural hand gestures and move in the associated directions. We hope this control option can help people with speech difficulties.

How we built it

We used the DJI Tello educational drone kit to build Raven. This drone kit comes with a Python API which users can connect to over a UDP stream and execute commands such as move_forward() and move_backward(). We designed our interfaces around this API to allow people to control the drone easily.

The Airpod Control We utilized Swift and Apple’s CMMotionManager class to extract raw imu pose estimates of the wearer's head. From this, we derive the roll, pitch, and yaw of the wearer's head to determine the direction the user wants to command the drone to fly. By tilting your head forward, the drone flies forward. Tilt your head backwards for backward and left/right to turn the drone in the respective direction.

To create the speech recognition mode, we first record speech from the user and transcribe it using OpenAI’s Whisper API. We then feed this transcription to GPT-4 to generate a list of associated commands to run that correspond to the user query.

To create the gesture recognition mode, we create an image stream of the user from their device’s webcam. We then process these images using the GPT-4 Vision API to process these images and interpret the user’s desired command.

Challenges we ran into

We ran into several challenges with connecting to the drone via udp and communicating with external api’s. To control the drone, you have to connect to the drone’s LAN which isn’t connected to the global WAN. This means that you cannot talk to external APIs when your connected to the drone. To solve this, we utilized a dual wifi card approach by modding our laptop to connect to two wifi networks in parallel.

Other challenges we ran into were getting a reliable video stream from the drone and displaying the stream to the user. We also ran into some latency issues with getting all the different processes involved talking to each other.

Accomplishments that we're proud of

Airpods control works and its super fun! Voice control feels like magic watching the drone do exactly what you say. We were also able to develop an end to end pipeline to translate speech into executable instructions. Lastly, we were able to solve our problems with receiving a reliable video stream just in time for vision based gesture control.

What we learned

We learned how to work with UDP streams which we used to communicate with our drone. We also utilized audio and image processing techniques to process input data for the speech and gesture recognition interfaces.

What's next for Raven

We view this project as an initial exploration of new interfaces unlocked by recent advancements in wearable and Artificial Intelligence. We think that such advances have significant implications for increasing accessibility.

We think the ideas behind this project could power toys that could open up a world of possibilities for kids who are unable to play with traditional RC toys

We also think the interfaces introduced in this project could power a whole host of robotic devices, including wheelchairs, that could drastically improve the quality of life for many individuals.