Differences

This shows you the differences between two versions of the page.

--- Member:sungbeanJo_tran [2021/01/17 14:40]
sungbean
+++ Member:sungbeanJo_tran [2021/01/18 16:21] (current)
sungbean
@@ Line 1: / Line 1: @@
-Traffic light detector strongly requires showing reliable performance in real-time and working for both small
+Choosing scales and aspect ratios for default boxes To handle different object scales,
-(i.e., 3x9 pixels) and large objects with low false positive and
+some methods [4,9] suggest processing the image at different sizes and combining the
-low false negative rates, while maintaining a high detection
+results afterwards. However, by utilizing feature maps from several different layers in a
-accuracy. For example, a false red traffic light will lead the
+single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps
-autonomous vehicle to abruptly stop while driving, while
+from the lower layers can improve semantic segmentation quality because the lower
-a missed red light will cause the vehicle to go through an
+layers capture more fine details of the input objects. Similarly, [12] showed that adding
-intersection originally with red lights in its course of driving.
+global context pooled from a feature map can help smooth the segmentation results.
-In this coarse-grained traffic light detection step, we focus
+Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the
-to reduce false negative (FN) rates or to collect as many true
+framework. In practice, we can use many more with small computational overhead.
-traffic lights as possible. We utilize the Single-Shot multi-box
+Feature maps from different levels within a network are known to have different
-Detector (SSD) [5] that has been shown to be an effective
+(empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each
-tool for an object detection task. Note that we use the SSD
+layer. We design the tiling of default boxes so that specific feature maps learn to be
-architecture that has shown improved detection accuracy in
+responsive to particular scales of the objects. Suppose we want to use m feature maps
-other benchmarks than YOLO network architecture, which
+for prediction. The scale of the default boxes for each feature map is computed as:
-was utilized in the existing work by Behrendt et al. [1]. More
-modern architecture, such as Mask R-CNN [6], may provide
-better detection accuracy, but we leave this comparison for
-future work. The SSD model is based on a convolutional
-network and takes the whole image as an input and predicts
-a fixed-size collection of bounding boxes and corresponding
-confident scores for the presence of object instances in
-those boxes. The final detections are then produced followed
-by a non-maximum suppression step – all detection boxes
-are sorted on the basis of their predicted scores, and the
-detections with maximum score is then selected, while other
-detections with a significant overlap are suppressed. As we
-described in Figure 2, we use a standard VGG-16 network
-architecture [7] as a base convolutional network, which is
-pre-trained on ImageNet Large Scale Visual Recognition
-Challenge (ILSVRC) dataset [8]. Auxiliary structures – convolutional predictor and the additional convolutional feature
-extractor – are used following the work by Liu et al. [5].
-) Training objective: The loss function L (= Lloc+Lconf)
-is a weighted sum of two types of loss: (1) the localization
-loss Lloc measures a Smooth L1 loss between the predicted
-and the ground-truth bounding box in a feature space. (2)
-The confidence loss Lconf is a softmax loss over multiple
-classes confidences. For more rigorous details, refer to [5]
-) Data augmentation: To train a robust detector to
-various object sizes, we use random cropping (the size of
-each sampled image is [0.5, 1] of the original image size
-with fixed aspect ratio) and flipping to yield consistent
-improvement. Following [5], we also sample an image so
-that the minimum jaccard overlap with the objects is {0.1,
-.3, 0.5, 0.7, 0.9}. Note that each sampled image is then
-resized to a fixed size followed by photometric distortions
-with respect to brightness, contrast, and saturation.

Trace:

Differences

Search

Navigation