To provide a navigation the vision impaired. We are interested in sensory perception, replacement, and augmentation.
What it does
AudioVision using a head mounted depth camera and IMU to create a 3D representation of the environment. It then maps the environment back into 3D space using directional audio.
How we built it
The core of our project is the Occipital structure sensor, a depth sensor similar to the Kinect. We use its depth data, combined with gyroscope and accelerometer data, to create a world-space mapping of our environment. We read frames from the camera as fast as possible, while concurrently reading data from the gyroscope. For each frame, we first use an inverse perspective projection to transform the frame into camera space. We use the gyroscope data to then transform this into world-space based on orientation. We store these and perform processing to remove noise, and determine interesting points in the world to place audio. This algorithm is expensive, sometimes only running at 10FPS. To lower latency, we keep past calculated sound points and transform them based on gyroscope data between reading frames. On each point cloud, we down-sample it using a voxel grid, then use a clustering algorithm to remove small areas that may be uninteresting or noise. From the candidate points left, we perform random sampling to select some as audio sources, then create a square wave with frequency based on depth. We composite these points with those of past frames, and use a head-related transfer function to create spatial sound.
Challenges we ran into
We had a lot of problems integrating all of our separate code bases into the final design.
A bug with proxy types combined with type inference in c++ when compiled with release versions was particularly frustrating.
A lot of work went into ensuring that our project was responsive to movement of the wearer. We run many expensive algorithms, but cannot lag behind someone quickly moving their head.
Accomplishments that we're proud of
Point orientation correction using IMU sensor readings. This was a huge milestone for because it meant that our project was actually feasible.
Using pitch to help augment our ability to detect how far away objects are, instead of just the gain of the sounds. This greatly improved the usability of our project.
The structure of the code was well thought out at the start of the project, allowing each team member to be able to work on their specific abstraction throughout the entire process.
What we learned
Having a plan for integration from the start is important. We also learned several technical skills such as openAL, clustering, voxel grids, and c++ in general
What's next for AudioVision
We would like to add the ability to read text via OCR, generate a text to speech audio clip, and place that audio clip in the 3d environment at the position of the text. This would allow the user to read text in the environment that isn't braille, or is too far away for a touch based text system. Examples include street signs, bill boards, building labels, and addresses