Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
Member:sungbeanJo_tran [2021/01/17 14:35] sungbean |
Member:sungbeanJo_tran [2021/01/18 16:21] (current) sungbean |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Auxiliary structures – convolutional predictor and the additional convolutional feature | + | Choosing scales and aspect ratios for default boxes To handle different object scales, |
| - | extractor – are used following the work by Liu et al. [5]. | + | some methods [4,9] suggest processing the image at different sizes and combining the |
| - | 1) Training objective: The loss function L (= Lloc+Lconf) | + | results afterwards. However, by utilizing feature maps from several different layers in a |
| - | is a weighted sum of two types of loss: (1) the localization | + | single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps |
| - | loss Lloc measures a Smooth L1 loss between the predicted | + | from the lower layers can improve semantic segmentation quality because the lower |
| - | and the ground-truth bounding box in a feature space. (2) | + | layers capture more fine details of the input objects. Similarly, [12] showed that adding |
| - | The confidence loss Lconf is a softmax loss over multiple | + | global context pooled from a feature map can help smooth the segmentation results. |
| - | classes confidences. For more rigorous details, refer to [5] | + | Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the |
| - | 2) Data augmentation: To train a robust detector to | + | framework. In practice, we can use many more with small computational overhead. |
| - | various object sizes, we use random cropping (the size of | + | Feature maps from different levels within a network are known to have different |
| - | each sampled image is [0.5, 1] of the original image size | + | (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each |
| - | with fixed aspect ratio) and flipping to yield consistent | + | layer. We design the tiling of default boxes so that specific feature maps learn to be |
| - | improvement. Following [5], we also sample an image so | + | responsive to particular scales of the objects. Suppose we want to use m feature maps |
| - | that the minimum jaccard overlap with the objects is {0.1, | + | for prediction. The scale of the default boxes for each feature map is computed as: |
| - | 0.3, 0.5, 0.7, 0.9}. Note that each sampled image is then | + | |
| - | resized to a fixed size followed by photometric distortions | + | |
| - | with respect to brightness, contrast, and saturation. | + | |