Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
Member:sungbeanJo_tran [2021/01/17 14:34] sungbean |
Member:sungbeanJo_tran [2021/01/18 16:21] (current) sungbean |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | Traffic light detector strongly requires showing reliable performance in real-time and working for both small | + | Choosing scales and aspect ratios for default boxes To handle different object scales, |
- | (i.e., 3x9 pixels) and large objects with low false positive and | + | some methods [4,9] suggest processing the image at different sizes and combining the |
- | low false negative rates, while maintaining a high detection | + | results afterwards. However, by utilizing feature maps from several different layers in a |
- | accuracy. For example, a false red traffic light will lead the | + | single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps |
- | autonomous vehicle to abruptly stop while driving, while | + | from the lower layers can improve semantic segmentation quality because the lower |
- | a missed red light will cause the vehicle to go through an | + | layers capture more fine details of the input objects. Similarly, [12] showed that adding |
- | intersection originally with red lights in its course of driving. | + | global context pooled from a feature map can help smooth the segmentation results. |
- | In this coarse-grained traffic light detection step, we focus | + | Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the |
- | to reduce false negative (FN) rates or to collect as many true | + | framework. In practice, we can use many more with small computational overhead. |
- | traffic lights as possible. We utilize the Single-Shot multi-box | + | Feature maps from different levels within a network are known to have different |
- | Detector (SSD) [5] that has been shown to be an effective | + | (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each |
- | tool for an object detection task. Note that we use the SSD | + | layer. We design the tiling of default boxes so that specific feature maps learn to be |
- | architecture that has shown improved detection accuracy in | + | responsive to particular scales of the objects. Suppose we want to use m feature maps |
- | other benchmarks than YOLO network architecture, which | + | for prediction. The scale of the default boxes for each feature map is computed as: |
- | was utilized in the existing work by Behrendt et al. [1]. More | + | |
- | modern architecture, such as Mask R-CNN [6], may provide | + | |
- | better detection accuracy, but we leave this comparison for | + | |
- | future work. The SSD model is based on a convolutional | + | |
- | network and takes the whole image as an input and predicts | + | |
- | a fixed-size collection of bounding boxes and corresponding | + | |
- | confident scores for the presence of object instances in | + | |
- | those boxes. The final detections are then produced followed | + | |
- | by a non-maximum suppression step – all detection boxes | + | |
- | are sorted on the basis of their predicted scores, and the | + | |
- | detections with maximum score is then selected, while other | + | |
- | detections with a significant overlap are suppressed. As we | + | |
- | described in Figure 2, we use a standard VGG-16 network | + | |
- | architecture [7] as a base convolutional network, which is | + | |
- | pre-trained on ImageNet Large Scale Visual Recognition | + |