Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Member:sungbeanJo_tran [2021/01/17 14:40]
sungbean
Member:sungbeanJo_tran [2021/01/18 16:21] (current)
sungbean
Line 1: Line 1:
-Traffic light detector strongly requires showing reliable performance in real-time ​and working ​for both small +Choosing scales ​and aspect ratios ​for default boxes To handle different object scales, 
-(i.e.3x9 pixels) and large objects with low false positive ​and +some methods [4,9] suggest processing the image at different sizes and combining the 
-low false negative rateswhile maintaining ​high detection +results afterwards. Howeverby utilizing feature maps from several different layers in 
-accuracy. For example, a false red traffic light will lead the +single network for prediction we can mimic the same effect, ​while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps 
-autonomous vehicle to abruptly stop while drivingwhile +from the lower layers can improve semantic segmentation quality because ​the lower 
-a missed red light will cause the vehicle to go through an +layers capture more fine details ​of the input objectsSimilarly, [12showed ​that adding 
-intersection originally with red lights in its course ​of driving. +global context pooled from a feature map can help smooth the segmentation results
-In this coarse-grained traffic light detection stepwe focus +Motivated ​by these methods, we use both the lower and upper feature maps for detectionFigure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the 
-to reduce false negative (FN) rates or to collect as many true +frameworkIn practice, we can use many more with small computational overhead. 
-traffic lights as possible. We utilize the Single-Shot multi-box +Feature maps from different levels within ​a network ​are known to have different 
-Detector (SSD) [5] that has been shown to be an effective +(empiricalreceptive field sizes [13]. Fortunately,​ within ​the SSD framework, ​the default boxes do not necessary need to correspond ​to the actual receptive fields ​of each 
-tool for an object detection taskNote that we use the SSD +layerWe design the tiling ​of default boxes so that specific feature maps learn to be 
-architecture that has shown improved detection accuracy in +responsive ​to particular scales of the objects. ​Suppose we want to use m feature maps 
-other benchmarks than YOLO network architecture,​ which +for predictionThe scale of the default boxes for each feature map is computed as:
-was utilized in the existing work by Behrendt et al. [1]. More +
-modern architecturesuch as Mask R-CNN [6], may provide +
-better detection accuracy, but we leave this comparison for +
-future work. The SSD model is based on a convolutional +
-network and takes the whole image as an input and predicts +
-a fixed-size collection of bounding boxes and corresponding +
-confident scores ​for the presence of object instances in +
-those boxesThe final detections ​are then produced followed +
-by a non-maximum suppression step – all detection boxes +
-are sorted on the basis of their predicted scores, and the +
-detections with maximum score is then selected, while other +
-detections with a significant overlap are suppressedAs we +
-described in Figure 2, we use a standard VGG-16 network +
-architecture [7] as base convolutional ​network, which is +
-pre-trained on ImageNet Large Scale Visual Recognition +
-Challenge ​(ILSVRCdataset ​[8]. Auxiliary structures – convolutional predictor and the additional convolutional feature +
-extractor – are used following ​the work by Liu et al. [5]. +
-1) Training objective: The loss function L (= Lloc+Lconf) +
-is a weighted sum of two types of loss: (1) the localization +
-loss Lloc measures a Smooth L1 loss between the predicted +
-and the ground-truth bounding box in a feature space. (2) +
-The confidence loss Lconf is a softmax loss over multiple +
-classes confidences. For more rigorous details, refer to [5] +
-2) Data augmentation:​ To train a robust detector ​to +
-various object sizes, we use random cropping (the size of +
-each sampled image is [0.5, 1] of the original image size +
-with fixed aspect ratio) and flipping ​to yield consistent +
-improvement. Following [5], we also sample an image so +
-that the minimum jaccard overlap with the objects ​is {0.1, +
-0.3, 0.5, 0.7, 0.9}. Note that each sampled image is then +
-resized to a fixed size followed by photometric distortions +
-with respect to brightness, contrast, and saturation.+
Navigation