Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
Member:sungbeanJo_paper [2021/03/09 11:02] sungbean |
Member:sungbeanJo_paper [2021/04/21 22:08] (current) sungbean |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | We are interested in a specific setting of imitation learning—the problem of learning to perform a | + | get_config_param active timestamp_mode |
- | task from expert demonstrations—in which the learner is given only samples of trajectories from | + | TIME_FROM_INTERNAL_OSC |
- | the expert, is not allowed to query the expert for more data while training, and is not provided | + | get_config_param active multipurpose_io_mode |
- | reinforcement signal of any kind. There are two main approaches suitable for this setting: behavioral | + | OUTPUT_OFF |
- | cloning [20], which learns a policy as a supervised learning problem over state-action pairs from | + | get_config_param active sync_pulse_in_polarity |
- | expert trajectories; and inverse reinforcement learning [25, 18], which finds a cost function under | + | ACTIVE_LOW |
- | which the expert is uniquely optimal. | + | get_config_param active nmea_in_polarity |
- | Behavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due | + | ACTIVE_HIGH |
- | to compounding error caused by covariate shift [23, 24]. Inverse reinforcement learning (IRL), on | + | get_config_param active nmea_baud_rate |
- | the other hand, learns a cost function that prioritizes entire trajectories over others, so compounding | + | BAUD_9600 |
- | error, a problem for methods that fit single-timestep decisions, is not an issue. Accordingly, IRL has | + | |
- | succeeded in a wide range of problems, from predicting behaviors of taxi drivers [31] to planning | + | |
- | footsteps for quadruped robots [22]. | + | |
- | Unfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learning | + | |
- | in an inner loop. Scaling IRL methods to large environments has thus been the focus of much | + | |
- | recent work [7, 14]. Fundamentally, however, IRL learns a cost function, which explains expert | + | |
- | behavior but does not directly tell the learner how to act. Given that learner’s true goal often is to | + | |
- | take actions imitating the expert—indeed, many IRL algorithms are evaluated on the quality of the | + | |
- | optimal actions of the costs they learn—why, then, must we learn a cost function, if doing so possibly | + | |
- | incurs significant computational expense yet fails to directly yield actions? | + | |
- | We desire an algorithm that tells us explicitly how to act by directly learning a policy. To develop such | + | |
- | an algorithm, we begin in Section 3, where we characterize the policy given by running reinforcement | + | |
- | learning on a cost function learned by maximum causal entropy IRL [31, 32]. Our characterization | + | |
- | introduces a framework for directly learning policies from data, bypassing any intermediate IRL step. | + | |
- | Then, we instantiate our framework in Sections 4 and 5 with a new model-free imitation learning | + | |
- | algorithm. We show that our resulting algorithm is intimately connected to generative adversarial | + | |
- | networks [9], a technique from the deep learning community that has led to recent successes in | + | |
- | modeling distributions of natural images: our algorithm harnesses generative adversarial training to fit | + | |
- | distributions of states and actions defining expert behavior. We test our algorithm in Section 6, where | + | |
- | we find that it outperforms competing methods by a wide margin in training policies for complex, | + | |
- | high-dimensional physics-based control tasks over various amounts of expert data. | + |