Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Member:sungbeanJo_paper [2021/03/09 11:02]
sungbean
Member:sungbeanJo_paper [2021/04/21 22:08] (current)
sungbean
Line 1: Line 1:
-We are interested in a specific setting of imitation learning—the problem of learning to perform a +get_config_param active timestamp_mode  
-task from expert demonstrations—in which the learner is given only samples of trajectories from + TIME_FROM_INTERNAL_OSC 
-the expert, is not allowed to query the expert for more data while training, and is not provided +get_config_param active multipurpose_io_mode 
-reinforcement signal of any kind. There are two main approaches suitable for this setting: behavioral + OUTPUT_OFF ​ 
-cloning [20], which learns a policy as a supervised learning problem over state-action pairs from +get_config_param active sync_pulse_in_polarity 
-expert trajectories;​ and inverse reinforcement learning [25, 18], which finds a cost function under + ACTIVE_LOW 
-which the expert is uniquely optimal. +get_config_param active nmea_in_polarity 
-Behavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due + ACTIVE_HIGH 
-to compounding error caused by covariate shift [23, 24]. Inverse reinforcement learning (IRL), on +get_config_param active nmea_baud_rate 
-the other hand, learns a cost function that prioritizes entire trajectories over others, so compounding + BAUD_9600 
-error, a problem for methods that fit single-timestep decisions, is not an issue. Accordingly,​ IRL has +
-succeeded in a wide range of problems, from predicting behaviors of taxi drivers [31] to planning +
-footsteps for quadruped robots [22]. +
-Unfortunately,​ many IRL algorithms are extremely expensive to run, requiring reinforcement learning +
-in an inner loop. Scaling IRL methods to large environments has thus been the focus of much +
-recent work [7, 14]. Fundamentally,​ however, IRL learns a cost function, which explains expert +
-behavior but does not directly tell the learner how to act. Given that learner’s true goal often is to +
-take actions imitating the expert—indeed,​ many IRL algorithms are evaluated on the quality of the +
-optimal actions of the costs they learn—why,​ then, must we learn a cost function, if doing so possibly +
-incurs significant computational expense yet fails to directly yield actions? +
-We desire an algorithm that tells us explicitly how to act by directly learning a policy. To develop such +
-an algorithm, we begin in Section 3, where we characterize the policy given by running reinforcement +
-learning on a cost function learned by maximum causal entropy IRL [31, 32]. Our characterization +
-introduces a framework for directly learning policies from data, bypassing any intermediate IRL step. +
-Then, we instantiate our framework in Sections 4 and 5 with a new model-free imitation learning +
-algorithm. We show that our resulting algorithm is intimately connected to generative adversarial +
-networks [9], a technique from the deep learning community that has led to recent successes in +
-modeling distributions of natural images: our algorithm harnesses generative adversarial training to fit +
-distributions of states and actions defining expert behavior. We test our algorithm in Section 6, where +
-we find that it outperforms competing methods by a wide margin in training policies for complex, +
-high-dimensional physics-based control tasks over various amounts of expert data.+
Navigation