This is an old revision of the document!
Assume that each observation o = hi;mi comprises an image i and a low-dimensional vector m that we refer to as measurements, following Dosovitskiy and Koltun [9]. The controller F is represented by a deep network. The network takes the image i, the measurements m, and the command c as inputs, and produces an action a as its output. The action space can be discrete, continuous, or a hybrid of these. In our driving experiments, the action space is continuous and two-dimensional: steering angle and acceleration. The acceleration can be negative, which corresponds to braking or driving backwards. The command c is a categorical variable represented by a one-hot vector. We study two approaches to incorporating the command c into the network. The first architecture is illustrated in Figure 3(a). The network takes the command as an input, alongside the image and the measurements. These three inputs are processed independently by three modules: an image module I(i), a measurement module M(m), and a command module C©. The image module is implemented as a convolutional network, the other two modules as fullyconnected networks. The outputs of these modules are concatenated into a joint representation:.
The control module, implemented as a fully-connected network, takes this joint representation and outputs an action A(j). We refer to this architecture as command input. It is applicable to both continuous and discrete commands of arbitrary dimensionality. However, the network is not forced to take the commands into account, which can lead to suboptimal performance in practice. We therefore designed an alternative architecture, shown in Figure 3(b). The image and measurement modules are as described above, but the command module is removed. Instead, we assume a discrete set of commands C = fc0; : : : ; cKg (including a default command c0 corresponding to no specific command given) and introduce a specialist branch Ai for each of the commands ci. The command c acts as a switch that selects which branch is used at any given time. The output of the network is thus We refer to this architecture as branched. The branches Ai are forced to learn sub-policies that correspond to different commands. In a driving scenario, one module might specialize in lane following, another in right turns, and a third in left turns. All modules share the perception stream.