This is an old revision of the document!
Abstract—Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learning on high-level command input. At test time, the learned driving policy functions as a chauffeur that handles sensorimotor coordination but continues to respond to navigational commands. We evaluate different architectures for conditional imitation learning in vision-based driving. We conduct experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area. Both systems drive based on visual input yet remain responsive to high-level navigational commands. Imitation learning is receiving renewed interest as a promising approach to training autonomous driving systems. Demonstrations of human driving are easy to collect at scale. Given such demonstrations, imitation learning can be used to train a model that maps perceptual inputs to control commands; for example, mapping camera images to steering and acceleration. This approach has been applied to lane following [27], [4] and off-road obstacle avoidance However, these systems have characteristic limitations. For example, the network trained by Bojarski et al. [4] was given control over lane and road following only. When a lane change or a turn from one road to another were required, the human driver had to take control
Why has imitation learning not scaled up to fully autonomous urban driving? One limitation is in the assumption that the optimal action can be inferred from the perceptual input alone. This assumption often does not hold in practice: for instance, when a car approaches an intersection, the camera input is not sufficient to predict whether the car should turn left, right, or go straight. Mathematically, the mapping from the image to the control command is no longer a function. Fitting a function approximator is thus bound to run into difficulties. This had already been observed in the work of Pomerleau: “Currently upon reaching a fork, the network may output two widely discrepant travel directions, one for each choice. The result is often an oscillation in the dictated travel direction” [27]. Even if the network can resolve the ambiguity in favor of some course of action, it may not be the one desired by the passenger, who lacks a communication channel for controlling the network itself.
In this paper, we address this challenge with conditional imitation learning. At training time, the model is given not only the perceptual input and the control signal, but also a representation of the expert’s intention. At test time, the network can be given corresponding commands, which resolve the ambiguity in the perceptuomotor mapping and allow the trained model to be controlled by a passenger or a topological planner, just as mapping applications and passengers provide turn-by-turn directions to human drivers. The trained network is thus freed from the task of planning and can devote its representational capacity to driving. This enables scaling imitation learning to vision-based driving in complex urban environments.
We evaluate the presented approach in realistic simulations of urban driving and on a 1/5 scale robotic truck. Both systems are shown in Figure 1. Simulation allows us to thoroughly analyze the importance of different modeling decisions, carefully compare the approach to relevant baselines, and conduct detailed ablation studies. Experiments with the physical system demonstrate that the approach can be successfully deployed in the physical world. Recordings of both systems are provided in the supplementary video.
We begin by describing the standard imitation learning setup and then proceed to our command-conditional formulation. Consider a controller that interacts with the environment over discrete time steps. At each time step t, the controller receives an observation ot and takes an action at. The basic idea behind imitation learning is to train a controller that mimics an expert. The training data is a set of observationaction pairs D = fhoi; aiigNi =1 generated by the expert. The assumption is that the expert is successful at performing the task of interest and that a controller trained to mimic the expert will also perform the task well. This is a supervised learning problem, in which the parameters of a function
. An implicit assumption behind this formulation is that the expert’s actions are fully explained by the observations; that is, there exists a function E that maps observations to the expert’s actions: ai = E(oi). If this assumption holds, a sufficiently expressive approximator will be able to fit the function E given enough data. This explains the success of imitation learning on tasks such as lane following. However, in more complex scenarios the assumption that the mapping of observations to actions is a function breaks down. Consider a driver approaching an intersection. The driver’s subsequent actions are not explained by the observations, but are additionally affected by the driver’s internal state, such as the intended destination. The same observations could lead to different actions, depending on this latent state. This could be modeled as stochasticity, but a stochastic formulation misses the underlying causes of the behavior. Moreover, even if a controller trained to imitate demonstrations of urban driving did learn to make turns and avoid collisions, it would still not constitute a useful driving system. It would wander the streets, making arbitrary decisions at intersections. A passenger in such a vehicle would not be able to communicate the intended direction of travel to the controller, or give it commands regarding which turns to take. To address this, we begin by explicitly modeling the expert’s internal state by a vector h, which together with the observation explains the expert’s action: ai = E(oi; hi). Vector h can include information about the expert’s intentions, goals, and prior knowledge. The standard imitation learning objective can then be rewritten as