Weseeon - See Through Sound, Feel Through Motion

Welcome to Weseeon!

Inspiration

Life isn’t about what you see, it’s about how you see.

We believe that vision doesn’t depend on sight. To move past the surface, we decided to go beyond the obvious. Rather than building mere interfaces, we tried building an experience. A design that isn’t immediately seen but felt.

We see possibilities. We see experience. We see on.

What it does

WeSeeOn transforms table tennis into a non visual and sensory driven experience.

Instead of relying on sight, players use haptic feedback and audio cues along with real time AI guidance. A webcam tracks the player and racket. Based on position relative to a virtual ball, a belt with buzzers guides left and right movement while a phone used as a racket vibrates in patterns to indicate height. A voice system provides minimal and adaptive coaching.

A multi agent system runs in the background to adjust difficulty, track performance, generate feedback, and narrate gameplay. The result is that a blind player can train independently and improve without needing a screen.

How we built it

The system combines computer vision, distributed hardware, real time communication, and multi agent AI.

The vision system is built in Python using OpenCV and Ultralytics YOLOv8 pose estimation. It tracks body position for lateral movement and uses HSV colour filtering to detect the phone for height guidance. A custom event driven state machine controls calibration, target generation, guidance, and hit detection in real time.

The backend is structured using modular Python services with Flask and flask-cors for lightweight API handling and orchestration between components.

The AI system is composed of multiple agents using a modular architecture. We used Fetch.ai uAgents to coordinate agent communication and task execution. An opponent agent adapts difficulty. A coach agent analyzes performance and generates feedback. An umpire agent handles narration. A performance agent computes accuracy, reaction time, and movement speed.

We used the Gemini API for feedback generation and reasoning. It processes gameplay data and generates structured coaching insights, natural language summaries, and adaptive suggestions for improvement.

We integrated Qualcomm AI-100 and the Qualcomm QNN SDK to accelerate performance analysis and support efficient computation of metrics such as reaction time and movement accuracy.

Audio output is handled through a hybrid text to speech pipeline using ElevenLabs, pyttsx3, and system level speech engines. This ensures low latency and fallback reliability. On the frontend, the Web Speech API is used for browser based voice interaction.

All devices communicate over MQTT using the HiveMQ public broker with the paho-mqtt client. This enables asynchronous and decoupled communication between the laptop, Arduino, and mobile device.

The hardware system uses an Arduino UNO Q programmed in C++ to control directional haptic feedback through buzzers. The phone uses the Web Vibration API to generate patterned vibration feedback based on incoming signals.

The frontend is built using React, Vite, and modern JavaScript with HTML and CSS. It includes a landing page, a live gameplay interface, and a post game dashboard. The interface is accessible with semantic HTML and ARIA support for screen readers.

We used Node.js and npm for dependency management and build tooling. Development was managed using Python virtual environments for reproducibility.

A conversational and automation layer integrates Browser Use and TwelveLabs to support contextual understanding, gameplay summaries, and intelligent feedback generation across agents.

Challenges we ran into

We wanted vibration sensors. We did not have vibration sensors. So we used motors. Then we realized motors were not ideal and switched to buzzers taped to a belt. It worked and also looked slightly questionable.

At one point we needed to open a motor driver and realized no one had a screwdriver. We tried using a pepper spray, chalk, and anything else that looked remotely sharp. Nothing worked well but everything added character.

Getting multiple AI agents to cooperate felt like coordinating a group project where everyone is smart but no one listens. Eventually they agreed on something that resembled teamwork.

The coach agent occasionally decided to become overly enthusiastic. It echoed, overlapped, and sometimes sounded like it was delivering a speech to a stadium instead of one player.

Combining real time vision, hardware, networking, and frontend meant that when something broke it was never just one thing. It was usually everything at once.

Accomplishments that we're proud of

We built a system that allows table tennis to be played without relying on vision.

We designed a dual channel haptic feedback system with directional cues on the belt and height cues on the phone.

We integrated multiple AI systems into a single gameplay loop that produces real time coaching and post game feedback.

We connected multiple hardware components so they operate together through wireless communication.

The webcam system accurately tracks the player and provides guidance that enables learning and improvement.

What we learned

Hardware takes patience. Software takes understanding. Both take sleep away.

If something does not work, it might be the code. It might be the wiring. It might be the network. It might be all three at the same time.

AI agents are powerful but only after you convince them to behave. Getting them to cooperate is less about intelligence and more about coordination.

Real time systems do not fail quietly. They fail loudly and all at once which makes debugging both painful and memorable.

We learned how to connect computer vision, frontend systems, embedded hardware, and multiple AI tools into one working pipeline. It was chaotic but it worked.