This is an old revision of the document!
Traffic light detector strongly requires showing reliable performance in real-time and working for both small (i.e., 3×9 pixels) and large objects with low false positive and low false negative rates, while maintaining a high detection accuracy. For example, a false red traffic light will lead the autonomous vehicle to abruptly stop while driving, while a missed red light will cause the vehicle to go through an intersection originally with red lights in its course of driving. In this coarse-grained traffic light detection step, we focus to reduce false negative (FN) rates or to collect as many true traffic lights as possible. We utilize the Single-Shot multi-box Detector (SSD) [5] that has been shown to be an effective tool for an object detection task. Note that we use the SSD architecture that has shown improved detection accuracy in other benchmarks than YOLO network architecture, which was utilized in the existing work by Behrendt et al. [1]. More modern architecture, such as Mask R-CNN [6], may provide better detection accuracy, but we leave this comparison for future work. The SSD model is based on a convolutional network and takes the whole image as an input and predicts a fixed-size collection of bounding boxes and corresponding confident scores for the presence of object instances in those boxes. The final detections are then produced followed by a non-maximum suppression step – all detection boxes are sorted on the basis of their predicted scores, and the detections with maximum score is then selected, while other detections with a significant overlap are suppressed. As we described in Figure 2, we use a standard VGG-16 network architecture [7] as a base convolutional network, which is pre-trained on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset [8]. Auxiliary structures – convolutional predictor and the additional convolutional feature extractor – are used following the work by Liu et al. [5]. 1) Training objective: The loss function L (= Lloc+Lconf) is a weighted sum of two types of loss: (1) the localization loss Lloc measures a Smooth L1 loss between the predicted and the ground-truth bounding box in a feature space. (2) The confidence loss Lconf is a softmax loss over multiple classes confidences. For more rigorous details, refer to [5] 2) Data augmentation: To train a robust detector to various object sizes, we use random cropping (the size of each sampled image is [0.5, 1] of the original image size with fixed aspect ratio) and flipping to yield consistent improvement. Following [5], we also sample an image so that the minimum jaccard overlap with the objects is {0.1, 0.3, 0.5, 0.7, 0.9}. Note that each sampled image is then resized to a fixed size followed by photometric distortions with respect to brightness, contrast, and saturation.