Object Detection

Definition

The task of predicting axis-aligned bounding boxes and class labels for all instances of target object categories in an image; requires solving localization and classification jointly.

Intuition

Unlike classification (one label per image), detection must find where each object is. YOLO-style models predict boxes directly from grid cells; anchor boxes provide shape priors; IoU measures localization quality; NMS removes duplicate detections.

Formal Description

Bounding box parameterization: = center + width/height, normalized to image or grid cell dimensions.

Intersection over Union (IoU):

Threshold typically 0.5 for “correct” detection; used in loss and NMS.

Anchor boxes: predefined box shapes (aspect ratios) at each grid cell position; each grid cell predicts multiple boxes (one per anchor); box prediction is a delta from the anchor shape; improves detection of elongated objects (cars, pedestrians).

YOLO (You Only Look Once): divide image into grid; each cell predicts boxes (one per anchor), each box has + class probabilities; output tensor: ; single forward pass — fast enough for real-time.

Non-Maximum Suppression (NMS):

  1. Discard boxes with
  2. For each class, sort remaining by confidence
  3. Greedily keep the highest-confidence box
  4. Remove all remaining boxes with IoU ≥ threshold against the kept box
  5. Repeat

Loss function (YOLO-style): sum of localization loss (MSE on box coordinates), confidence loss (BCE on objectness), classification loss (BCE or CE).

Applications

Autonomous driving (pedestrian/vehicle detection), surveillance, medical imaging (lesion detection), retail (inventory counting).

Trade-offs

  • YOLO trades accuracy for speed vs. two-stage detectors (Faster R-CNN)
  • NMS is sequential and hard to parallelize
  • Anchor box design is domain-specific (sizes must match typical object scales)
  • Recent anchor-free methods (FCOS, DETR) avoid anchor design