Overfit

Inspiration

The task was 102-way fine-grained image classification, ranked by Macro F1 on a hidden test set. We were inspired by the fact that the evaluation metric is not accuracy: Macro F1 averages per-class F1, so every class counts equally. Formally, Macro F1 = (1/K) * sum over k of F1^(k), where F1^(k) = 2 * precision^(k) * recall^(k) / (precision^(k) + recall^(k)). That means minority classes matter as much as majority ones. We wanted to build a pipeline that explicitly targeted this metric and treated class imbalance as a first-class concern.

What it does

The project trains a fine-grained image classifier over 102 categories (Category 1: with pretrained models). Given an image, it predicts a single label in 1 to 102. The system is optimized for Macro F1 (the competition metric): we save the checkpoint with the best validation Macro F1 and use class-balanced sampling and loss weighting so that all classes contribute. At inference, predict.py loads the saved model, preprocesses images (Resize, CenterCrop, normalize), and optionally uses light test-time augmentation (e.g. center crop + horizontal flip, average logits) while staying within the 10-minute limit.

How we built it

Baseline: EfficientNet-B3 pretrained on ImageNet, with the final layer replaced for 102 classes. Training used AdamW, cosine learning-rate decay, and standard augmentation (RandomResizedCrop, horizontal flip, ColorJitter, rotation). We added fixed random seeds and a pinned requirements.txt for reproducibility.

Metric-driven training: We compute validation Macro F1 every epoch and save the checkpoint with the best validation Macro F1 (not best accuracy). Early stopping triggers when Macro F1 does not improve for several epochs.

Class balancing: We use sklearn's compute_class_weight("balanced", ...) for the loss and WeightedRandomSampler with compute_sample_weight("balanced", ...) so batches are more balanced across the 102 classes.

Freeze–then–unfreeze: Phase 1 trains only the classifier (backbone frozen) for a few epochs with a higher LR; phase 2 unfreezes the backbone and uses discriminative learning rates (e.g. backbone LR = 1e-5, head LR = 1e-4) with a short warmup and cosine decay.

MixUp and CutMix: Each training batch is either MixUp or CutMix (e.g. 50/50). For MixUp, x_mix = lambda * x_i + (1 - lambda) * x_j with lambda from Beta(alpha, alpha); for CutMix, a random patch is pasted from another image. The loss is L = lambda * L(f(x_mix), y_a) + (1 - lambda) * L(f(x_mix), y_b). Inference stays standard (no mixing).

Inference: predict.py loads the saved state dict, applies the same preprocessing, and optionally uses light TTA (center + flip, average logits).

Challenges we ran into

Val accuracy vs Macro F1: Early on, validation accuracy was high but the best checkpoint by accuracy was not the best by Macro F1. We had to switch to saving by validation Macro F1 so the submitted model matched the leaderboard metric.

Overfitting after the best epoch: With more epochs, training loss kept decreasing while validation Macro F1 dropped. We addressed this with early stopping, stronger regularization (weight decay, label smoothing), and MixUp/CutMix so the best saved model is from the right point in training.

Balancing 102 classes: With possible class imbalance, we combined WeightedRandomSampler and class-weighted cross-entropy so that no class was effectively ignored. Tuning the freeze duration and phase-2 learning rates also mattered for stability and final Macro F1.

Reproducibility: The guidelines required fully reproducible runs. We set seeds for random, numpy, and torch (and CUDA when used) and pinned dependency versions in requirements.txt.

Accomplishments that we're proud of

Reaching strong validation performance (e.g. ~92% val accuracy in an earlier setup) and then explicitly optimizing for Macro F1 so the submitted model is aligned with the evaluation metric. Implementing a full pipeline: class-balanced sampling, freeze–then–unfreeze schedule, MixUp/CutMix, metric-driven checkpointing, and early stopping—all while keeping inference simple and within the 10-minute limit. Delivering a reproducible submission (fixed seeds, pinned requirements.txt) that complies with Category 1 (pretrained models, no external image data, no LLMs). What we learned

Optimize what you're evaluated on. Saving the best model by validation Macro F1 (not accuracy) made the submitted model directly target the leaderboard metric. Class balance helps Macro F1. Class-weighted cross-entropy plus WeightedRandomSampler improved per-class recall and thus Macro F1. Freeze–then–unfreeze is effective. Training the head first with a higher LR (e.g. 1e-3), then unfreezing the backbone with lower LRs, gave faster and more stable improvement than training everything from the start. MixUp and CutMix are strong regularizers. They reduced overfitting and improved generalization in our experiments. What's next for Track 1 Classification

Try EfficientNet-B4 or B5 for more capacity if compute allows, and update predict.py to load the same architecture. Experiment with Focal Loss in addition to (or instead of) class-weighted cross-entropy for harder examples. Optionally switch to timm for more model and augmentation options, with matching changes in predict.py and requirements.txt.