This is an old revision of the document!


Reinforcement learning (RL) instead assumes that drivers in the real world follow an expert policy E whose actions maximize the expected, global return weighted by a discount factor 2 [0; 1). The local reward function r(st; at) may be unknown, but fully characterizes expert behavior such that any policy optimizing R(; r) will perform indistinguishably from E. Learning with respect to R(; r) has several advantages over maximum likelihood BC in the context of sequential decision making [21]. First, r(st; at) is defined for all stateaction pairs, allowing an agent to receive a learning signal even from unusual states. In contrast, BC only receives a learning signal for those states represented in a labeled, finite dataset. Second, unlike labels, rewards allow a learner to establish preferences between mildly undesirable behavior (e.g., hard braking) and extremely undesirable behavioral (e.g., collisions). And finally, RL maximizes the global, expected return on a trajectory, rather than local instructions for each observation. Once preferences are learned, a policy may take mildly undesirable actions now in order to avoid awful situations later. As such, reinforcement learning algorithms provide robustness against cascading errors.

Navigation