Montion-Tracked

Qiwu Wen qiwu.wen@umontreal.ca 11:49 AM (10 minutes ago) to me

Project Overview Our project focuses on robust target tracking in medical videos. The goal is: once a user selects and annotates a target in the video (e.g., an anatomical structure or region of interest), the system should continuously maintain accurate annotations across frames, even when the target undergoes translation, rotation, and appearance changes. This is designed for medical imaging settings where stable, consistent labeling across time is important. To achieve this, we build our pipeline on Cutie, a transformer-based video segmentation/tracking model introduced in a CVPR paper, and directly use its pretrained weights. This avoids the need for large-scale manual labeling and fine-tuning, while still providing strong tracking capability. Efficiency-Oriented Inference A key challenge is inference speed: transformer-based models are computationally expensive, especially on high-resolution medical videos. To reach near real-time performance without a large drop in tracking quality, we combine three optimizations that reinforce each other: ROI (Region of Interest) Inference Instead of processing the full frame, we run Cutie only on a cropped region around the target. This reduces the number of pixels/tokens processed per frame and directly lowers compute cost. Resolution Reduction (Internal Downscaling) We reduce Cutie’s internal processing resolution, further decreasing the per-frame workload. This provides a strong speed boost while keeping the tracking output stable for the target region. Optical Flow Simulation with LK + Keyframe Correction For additional speed, we use Lucas–Kanade (LK) optical flow to propagate the annotation between frames as a lightweight approximation. Optical flow alone often fails when the annotation leaves the field of view or after heavy occlusion, because it cannot “re-detect” the target. To address this, we use Cutie-generated frames as keyframes: Cutie periodically re-estimates the target mask and reanchors the tracking, allowing recovery after drift or re-entry into the scene. Additionally, the optical-flow-predicted location is used to dynamically define the ROI range, so Cutie’s keyframe inference runs on an even smaller, more focused region. This further reduces Cutie’s processing area and improves overall throughput. By combining Cutie (pretrained transformer tracking), ROI cropping, internal downscaling, and LK optical flow with Cutie keyframe re-anchoring, our system achieves near real-time tracking while maintaining strong robustness under common motion and appearance changes in medical video.