*Finally, something to do with your VR headset! (friends sold separately.). * (repo)
Half of our team recently won VR headsets at a previous hackathon. Eventually, the novelty of Beat Saber, Gorilla Tag and other somewhat dubious activities wore off so we decided to make something ourselves, drawing inspiration from various party games.
3Draw is the latest in a long series of thing-drawing games like skribbl.io or Gartic Phone. We stand out via our chaotic, AI-powered, voice-chat guessing system and largely dysfunctional artistic tools. Try it yourself at 3draw , I'm not going to tell you how it works! (if you have a headset and friends to play with that is, our demo video will suffice otherwise.)
Usually we'd flex the complicated
docker-compose architecture here. We ended up having a pleasantly vanilla project:
Thanks to our resident Nix enthusiast,
yarn.nix dominated the line count but the human written code was (nearly) all HTML, CSS and JS.
We used Cloudflare Tunnels for both testing and deploy which we had set up in the first 14 seconds of hacking, it was extremely convenient. We used WebXR, interfacing with it through the batteries-included (though dated) A-Frame framework. We used RTC to communicate between A-Frame instances via Networked A-Frame (NAF) and an EasyRTC instance. 3D modeling work was done in Blender, we borrowed a castle from here (attribution!). To manage state, we created a chimera of various techniques that could be (generously) described as an amalgamation of MVC (fattest controllers you've ever seen) and React.
Voice chat was conveyed over RTC (as were most things) but speech-to-text was a nightmare. The original plan was to use the Web Speech API, we thought the Webkit prefixed version would be available since Oculus headsets are glorified Android devices and their browser is based on Chromium under the hood. Apparently somewhere along the chain the assumptions broke down. Part of our team spent about 8 hours trying to implement a client-side TFLite on TensorFlow.js model for limited-vocabulary recognition, effectively reverse engineering the Web Speech API. We needed a model that would be easy to train and extremely low latency so leveraging word embeddings was an obvious plan, until it wasn't. Object guesses did not appear in ordinary contexts and it seems that SOTA techniques for limited-vocabulary detection are now much more sophisticated with pipelines of several models. As such, we turned to the omniscient GCP Cloud Speech API and had incredible difficulty streaming audio to it in a low latency manner. In fact, the latency for the GCP Cloud Speech API was in the order of minutes. Whether that was due to misconfiguration on our end or issues on Google's end, the world may never know. Additionally, the speech recognition pipeline just wasn't very good.
Probably actually playing it in a setting other than (hackathon-induced-)stress testing. If it's fun, perhaps we'll flesh it out a bit more.
rm -rf will suffice otherwise.
I. McGraw et al., “Personalized Speech recognition on mobile devices,” arXiv:1603.03185 [cs], Mar. 2016, Accessed: Feb. 26, 2022. [Online]. Available: http://arxiv.org/abs/1603.03185
P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv:1804.03209 [cs], Apr. 2018, Accessed: Feb. 26, 2022. [Online]. Available: http://arxiv.org/abs/1804.03209
J. Shor et al., “Towards Learning a Universal Non-Semantic Representation of Speech,” Interspeech 2020, pp. 140–144, Oct. 2020, doi: 10.21437/Interspeech.2020-1242.