Bug404 - UI Bug Detection Model

Mock_output 1
Mock_output 2
Mock_output 2

What it does

Bug404 is an object detection model that detects unexpected UI anomalies directly from screenshots. Unlike traditional pipelines that only validate against user-defined test cases, Bug404 automatically highlights defects that testers may not have anticipated — for example, a missing image elsewhere on the page, even if the primary test case (e.g., adding an item to a cart) passes. By leveraging object detection, Bug404 provides a more robust layer of assurance in UI automation testing.

How we built it

Data collection

We began with the publicly available OwlEye UI-bug dataset. Because OwlEye’s labeling scheme did not align with our target taxonomy, we re-annotated ~2,000 images in CVAT, drawing bounding boxes to match our definitions. The annotation effort took ~6 hours. We then exported the labels in Pascal VOC XML for our primary model and in YOLO format for the baseline.

Model training

Given time and compute constraints, we trained on the NUS PC Cluster rather than locally. All experiments ran on an NVIDIA A40 (48GB DDR6 RAM). With a batch size of 32, each epoch completed in ~1.5 minutes; training the four backbones consumed ~10 hours of total GPU time. We reduced wall-clock time through parallelized execution and by resizing inputs to 416x416 to lower resolutions.

Architecture

We built Bug404 on top of Faster R-CNN, a proven object detection framework. To improve detection coverage while keeping inference speed practical, we:

Leveraged four CNN backbones — ResNet-18, ResNet-34, ResNet-50, and ResNet-101 for multi-level feature extraction, and ResNet-101 for ROI head (classification).
Employed a Region Proposal Network (RPN) on each feature map in parallel to generate high-quality candidate bounding boxes efficiently.
Merged the proposals and applied Soft Non-Maximum Suppression (NMS) to retain the most confident detections.

This architecture allows Bug404 to both infer at relatively fast speed and localize UI anomalies without requiring manual prompts or scripted test cases.

Challenges we ran into

Data availability: There are very few open-source datasets dedicated to UI bugs, so we had to perform our own annotations. In fact, part of our dataset comes from re-annotating OwlEye’s data to better fit our task.
Time limitations: With only three days, ensuring everything ran smoothly was difficult. Each model training took several hours, even on a rented high-performance GPU, leaving limited room for iteration.
Generalization: We trained on a dataset of ~1900 images with unbalanced distributions of UI faults. Most of the samples came from older Android apps, which means our model may struggle to generalize to modern apps with updated UI designs or to other platforms such as iOS or web applications.

Accomplishments that we're proud of

Building a model that goes beyond scripted testing, catching anomalies testers did not explicitly define.
Surpassing strong baselines such as YOLOv11 and OwlEye in both precision and recall, despite limited time and resources:
- Compared to OwlEye, our model achieved better results even though we only used about half of their total faulty UI screenshot data.
- For YOLOv11, when trained on the exact same dataset as ours, our model consistently delivered superior performance in both precision and recall.
Implementing parallel inferencing, so the runtime is effectively bounded by the largest backbone (ResNet-101), ensuring scalability without slowing down.

What we learned

Multi-scale feature extraction and aggregating RPN outputs significantly improve anomaly coverage while maintaining feasible inference time.
UI bug detection is as much a data challenge as a technical one: high-quality datasets with clear guidelines on bug categories and bounding box definitions are crucial.

What's next for Bug404 - UI Bug Detection Model

We see several promising directions to extend Bug404:

Better datasets: Build or contribute to large-scale, high-quality UI anomaly datasets to strengthen training.
Ensemble learning: Explore model ensembles to push precision and recall even further.
Vision Transformers: Incorporate vision transformers for richer contextual understanding of UI layouts.
Industry application: Integrate Bug404 into TikTok’s QA infrastructure to accelerate testing and catch unexpected issues at scale.

—