Stack

  • Used TensorFlow for modeling creation, training, and lazy data loading for resource reduction
  • Used MediaPipe for pose estimation to create tensors to feed into shot classifier models
  • Used YOLO for human detection, court detection after fine tuning, and ball tracking after fine tuning
  • Used Numpy for implementation complex mathematical formulas and loss functions
  • Used FastAPI for fast and effective endpoints
  • Used OpenCV for homography estimation and video data management
  • Used Boto3 (Cloudflare) to save processed videos into the cloud for long-term saving

Inspiration

I play tennis, a lot, and it's a pretty core part of my life. Recently, I tried SwingVision, which was a tennis analytics app, and it was good. It is important to take into consideration that it's very expensive and restrictive. It is also only for iOS, meaning that the majority of Android users do not have access, and no other app exists like it. And thus I decided to take matters into my own hands by building a cross-platform, budget friendly version, complete with a lot of its core features.

How I built it + challenges I faced

I'll give an explanation of each feature and how it works:

  1. Shot classification: I recorded a few hundred videos of me doing various shots in my basement. I created a dataset using these videos by utilizing YOLO to crop out the player, then using mediapipe to extract pose keypoints, and then feeding that into the model with it outputting either forehand, backhand, slice/volley, or serve/overhead. Each stroke is represented as a temporal tensor in ℝ^(T × 33 × 3), where T is the number of frames, 33 is the number of pose landmarks, and each landmark contains (x, y, z) coordinates. This allows the model to learn motion dynamics rather than static poses. Instead of averaging frames uniformly, I used an attention-based temporal model to learn which moments in a stroke contribute most to classification. This improves robustness against noisy keypoints and partial occlusions. This can easily generalize due to the fact that the main features of these groundstrokes are different from stroke to stroke, but similar from person to person. This also solves the issue of mediapipe producing incorrect keypoints as a result of an obscured/tiny player, especially in high pixel resolutions. I originally attempted to find online datasets, but was met with poor quality and no luck. To train both models, I utilized the ADAM optimizer as well as SCCE (sparse categorical cross entropy) loss to ensure that the model isn't just right, it's confident. These hyperparams paired with shuffling, batching, and L2 (ridge) regularization, the model has exceptional performance.

Stroke representation


$$
X \in \mathbb{R}^{T \times 33 \times 3}
$$

Attention Mechanism


\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Softmax activation


\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

SCCE


\mathcal{L}_{\text{SCCE}} = - \sum_{i=1}^{N} \log \left( \frac{e^{z_{i, y_i}}}{\sum_{j=1}^{C} e^{z_{i,j}}} \right)

L2 weight decay (ridge regularization)


\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{SCCE}} + \lambda \sum_{k} \| W_k \|_2^2

ADAM momentum equations

\begin{align*} & m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, &\text{(first moment / momentum)} \\ & v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, &\text{(second moment / variance)} \\ &\hat{m}_t = \frac{m_t}{1-\beta_1^t}, &\hat{v}_t = \frac{v_t}{1-\beta_2^t} \\ &\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align*}

  1. Ball tracking: I originally attempted to use a TrackNet architecture, but I ran into issues with empty heatmaps and OOM errors due to sheer data size and model complexity. Instead, I went online and found a fine-tuned YOLO model from HuggingFace which I upgraded using a Savitzky-Golay filter and a gradient trail to smoothen it out. The Savitzky–Golay filter fits a low-degree polynomial over a sliding window via least-squares regression, reducing high-frequency noise while preserving trajectory curvature. This was critical for maintaining realistic ball motion without phase distortion. Raw detection outputs produce volatile jitter due to bounding box variance. By minimizing local regression error over time, trajectory stability improved significantly. This was definitely one of the most important parts, as it opens up a whole array of new features.

Savitzky-Golay filter


\hat{y}_i = \sum_{k=-m}^{m} c_k y_{i+k}

  1. Court detection: I first attempted to use the hough transform along with a canny edge detector to be able to detect the major court lines, but it was extremely crude and was not generalizable. Instead, I fine-tuned a YOLO model using images of tennis courts, getting very promising results. However, there was an issue here, and that being that YOLO can only output rectangles, not trapezoids, as often seen in tennis matchplay. Because YOLO outputs axis-aligned bounding boxes, I derived a geometric scaling factor based on height-to-width ratios to approximate projective compression of the far baseline. This allowed approximate correction of perspective distortion without requiring full homography estimation.

Homography


\mathbf{x'} \sim H \mathbf{x}, \quad H \in \mathbb{R}^{3 \times 3}

  1. *2D Court modeling: * True homography requires accurate planar correspondence, but monocular depth ambiguity introduced instability in real-world footage. Instead, I constructed a normalized coordinate mapping from detected court vertices into a fixed minimap coordinate system. Using the positions of the ball and having all 4 court vertices, I created a "minimap" of the court with the ball traveling in it. I first attempted to use homography and manual court labeling, but it caused a huge array of problems due to the underlying issue of depth perception. After coming to a stark realization, I decided I would attempt homography one more time, as I now have dynamically found court coordinates, and my ball tracking was much smoother. This ended up working quite well in my favor, as perspective warps caused by using plain ratios were almost nonexistent. This drastically increased the accuracy of the ball position on the minimap.

Projection transformation


(x, y) = \left(\frac{X}{Z}, \frac{Y}{Z}\right)

  1. Ball speed estimation: The biggest problem with estimating ball speed is the fact that the computer sees 2D motion, not 3D. To approach this problem, I first found the pixels-to-meters using the length of the baseline in pixels and the actual dimensions. I then kept a 60 frame buffer and evaluated velocity on it, cleared the buffer, and repeated the process. This actually worked extremely well compared to my earlier approaches of just using V = final - initial / time.

Pixel to meter conversion


\text{mpp} = \frac{L_{\text{real}}}{L_{\text{pixels}}}

Velocity approximation


v \approx \frac{\|\mathbf{p}_t - \mathbf{p}_{t-\Delta t}\|}{\Delta t}

What's next

I have been working on 3D reconstruction, but it has proved to be extremely difficult, as only a few research papers have successfully pulled it off, that too, with powerful hardware and teams of PhDs. Monocular 3D reconstruction is fundamentally ill-posed due to depth ambiguity: a 2D trajectory corresponds to infinitely many 3D trajectories without additional constraints. Solving this reliably requires multi-camera systems, stereo vision, or learned depth priors. I have began building a frontend in React Native, but it is still far from being complete.

What we learned

I learned a great deal about how CV systems work in production. They need to be able to handle bad lighting, weird camera angles, and a variety of different test cases. I learned about the necessity of speed over pure performance, as well as the security checks needed, especially when working with providers like Cloudflare which are notorious for high bills if limits are exceeded.

How it could be developed

Due to the fact that the backend is complete (uses FastAPI), all thats left to do would be a create a visually appealing UI. The larger problem at hand is the issue of servers. Most basic webapps don't require extensive processing, but my pipeline runs multiple YOLO instances, mediapipe nets, and neural networks, while also processing 720p footage at 60fps. Most free-tier website hosters have extremely limited RAM, and an even slower CPU, bottlenecking performance, and decreasing UX quality. There are alternatives that can easily handle this, but it would require a lot of money.

Targeted users

Anyone could use TennisTracker, from complete beginners to seasoned professionals.

Built With

Share this project:

Updates