Inspiration
Skilled physical workers with years of experience, such as a welder, technician, or assembly operators, are irreplaceable.
However, if they're sick, they're unable to perform, and if their condition worsens, they aren't able to use their skill. In addition, the employer loses that experience on the factory floor.
We felt this deeply both during COVID-19 and the years that came after, and we aimed to solve it with ControVirtual.
What it does
ControVirtual connects a skilled operator wearing a Meta Quest 3S to a pair of LeRobot arms, from anywhere. The operator can see a live, immersive 3D camera feed of the robot's physical environment inside their Quest headset. To minimize difficulty of onboarding, they can control it with natural language commands; in other words, they can talk to it like they're instructing a colleague.
Through this, any business with skilled workers can now retain them and their expertise, through illness, mobility limitations, or geographic distance, by investing only as much as a small number of Meta Quest headsets and a few pairs of LeRobot arms.
How we built it
The major components of our development include:
Robot Movement: We used the LeRobot API and behavioral cloning pipeline to train the follower arm from a small set of human demonstrations. The goal was a policy that could generalize to novel object positions rather than replay fixed scripts, while running inference tasks fast enough to feel responsive in real time. We kept the architecture lightweight specifically to minimize latency and reduce the number of episodes needed to get a working deployment.
Voice Control: Operator commands are captured on-device and processed through wit.ai, which parses natural speech into structured, precise motor commands for the arms.
VR Visualization: The robot’s camera feed is streamed in real time to the Meta Quest 3S and rendered as a full 3D scene inside the headset using Unity. Since the operator now has complete spatial visibility of the robot’s environment, they are present in the scene itself, and now can focus only on their task. To stabilize this, we had to implement careful frame synchronization through the pipeline connecting the camera to Unity’s rendering loop to minimize drift and latency.
Server: A local Python TCP/IP server bridges the Quest, the wit.ai response, and the LeRobot API, keeping the entire setup fully local, which is minimizes latency and optimizes for enterprise adoption.
Challenges we ran into
The hardest challenge we faced was making our system practical for deployment in real-life scenarios, such as a factory. Given that an enterprise cannot realistically record thousands of episodes for all tasks, and that a pretrained model may not cleanly transfer to all environments, we had to design an algorithm for inference that could maintain our need for fast inference, while being able to learn from a very small number of episodes.
To solve this problem, we modified the standard Imitation Learning algorithm, to be able to learn from as few as 50 episodes, and prioritized diversity over quantity of episodes. More specifically, we restructured how demonstrations are sampled during training to maximize coverage of the task space, which allowed the policy to generalize from a small but well-curated dataset.
Accomplishments that we're proud of
We're extremely proud of our implementation of VR visualization, specifically, minimizing latency from the robot in a 3D scene rendered from a private, local server, with no cloud infrastructure. We are also proud of the few-episode Imitation Learning algorithm, which makes our system applicable in real-life scenes.
What we learned
We learned key insights into training imitation learning, and other RL algorithms, such as the factors of quality and diversity of the dataset, and how to be able to create datasets that find the optimal fits for these parameters.
What's next for Team Too
We plan to implement simulations: virtual replicas of the physical environment, in the headset, in order to onboard and train new workers faster, as well as be able to preview unfamiliar hardware safely. The biggest challenge, we estimate, will be creating a virtual replica on the headset, while matching every real detail perfectly.
Built With
- imitation-learning
- natural-language-processing
- python
- pytorch
- tcp-ip
- unity
- wit.ai
Log in or sign up for Devpost to join the conversation.