VR/AR technology is increasingly becoming more mainstream as the prices of computer rigs go down, smarter algorithms for video compression and display become implemented, better latency for transfers between the headset and the computer, etc. One area that hasn't really been explored is hand and body pose estimation with conventional cameras - complete VR setups have two IR sensors on either corner of the room. One way to make VR/AR technology more accessible in the future is to eliminate the setup and calibration of the IR sensors.

What it does

We take two image streams coming from two different cameras angled differently and we output a single struct with precise keypoints corresponding to the various body parts.

How we built it

We use pretrained Posenet to estimate poses from single image streams. With two camera streams, we can estimate poses of the same person from two different angles. We then fuse the corresponding keypoints from the two different image streams using a Kalman filter.

Challenges we ran into

Index matching for the two image streams, calibration between two cameras (we created a synthetic calibration matrix but in reality, this would have to be done with a checkerboard or a disparity map), filtering out of other people, it was also running at 10 fps.

Accomplishments that we're proud of

We have stepped closer towards making VR/AR technology more portable.

What we learned

Web development and nuanced details of creating a web application (we have something that is still in its infancy),

What's next for Multi Sensor Pose Estimation for VR/AR

Shodan io api integration (we haven't been able to make this work yet) to pull image streams from restaurants or any public place, performance optimization, create a viable service for multiple people to use it simultaneously, test it with a VR headset (need more powerful computers or need to come in earlier)

Built With

Share this project: