This is an old revision of the document!
Auxiliary structures – convolutional predictor and the additional convolutional feature extractor – are used following the work by Liu et al. [5]. 1) Training objective: The loss function L (= Lloc+Lconf) is a weighted sum of two types of loss: (1) the localization loss Lloc measures a Smooth L1 loss between the predicted and the ground-truth bounding box in a feature space. (2) The confidence loss Lconf is a softmax loss over multiple classes confidences. For more rigorous details, refer to [5] 2) Data augmentation: To train a robust detector to various object sizes, we use random cropping (the size of each sampled image is [0.5, 1] of the original image size with fixed aspect ratio) and flipping to yield consistent improvement. Following [5], we also sample an image so that the minimum jaccard overlap with the objects is {0.1, 0.3, 0.5, 0.7, 0.9}. Note that each sampled image is then resized to a fixed size followed by photometric distortions with respect to brightness, contrast, and saturation.