Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision | |||
|
Member:sungbeanJo_paper [2021/03/09 11:02] sungbean |
Member:sungbeanJo_paper [2021/04/21 22:08] (current) sungbean |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | We are interested in a specific setting of imitation learning—the problem of learning to perform a | + | get_config_param active timestamp_mode |
| - | task from expert demonstrations—in which the learner is given only samples of trajectories from | + | TIME_FROM_INTERNAL_OSC |
| - | the expert, is not allowed to query the expert for more data while training, and is not provided | + | get_config_param active multipurpose_io_mode |
| - | reinforcement signal of any kind. There are two main approaches suitable for this setting: behavioral | + | OUTPUT_OFF |
| - | cloning [20], which learns a policy as a supervised learning problem over state-action pairs from | + | get_config_param active sync_pulse_in_polarity |
| - | expert trajectories; and inverse reinforcement learning [25, 18], which finds a cost function under | + | ACTIVE_LOW |
| - | which the expert is uniquely optimal. | + | get_config_param active nmea_in_polarity |
| - | Behavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due | + | ACTIVE_HIGH |
| - | to compounding error caused by covariate shift [23, 24]. Inverse reinforcement learning (IRL), on | + | get_config_param active nmea_baud_rate |
| - | the other hand, learns a cost function that prioritizes entire trajectories over others, so compounding | + | BAUD_9600 |
| - | error, a problem for methods that fit single-timestep decisions, is not an issue. Accordingly, IRL has | + | |
| - | succeeded in a wide range of problems, from predicting behaviors of taxi drivers [31] to planning | + | |
| - | footsteps for quadruped robots [22]. | + | |
| - | Unfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learning | + | |
| - | in an inner loop. Scaling IRL methods to large environments has thus been the focus of much | + | |
| - | recent work [7, 14]. Fundamentally, however, IRL learns a cost function, which explains expert | + | |
| - | behavior but does not directly tell the learner how to act. Given that learner’s true goal often is to | + | |
| - | take actions imitating the expert—indeed, many IRL algorithms are evaluated on the quality of the | + | |
| - | optimal actions of the costs they learn—why, then, must we learn a cost function, if doing so possibly | + | |
| - | incurs significant computational expense yet fails to directly yield actions? | + | |
| - | We desire an algorithm that tells us explicitly how to act by directly learning a policy. To develop such | + | |
| - | an algorithm, we begin in Section 3, where we characterize the policy given by running reinforcement | + | |
| - | learning on a cost function learned by maximum causal entropy IRL [31, 32]. Our characterization | + | |
| - | introduces a framework for directly learning policies from data, bypassing any intermediate IRL step. | + | |
| - | Then, we instantiate our framework in Sections 4 and 5 with a new model-free imitation learning | + | |
| - | algorithm. We show that our resulting algorithm is intimately connected to generative adversarial | + | |
| - | networks [9], a technique from the deep learning community that has led to recent successes in | + | |
| - | modeling distributions of natural images: our algorithm harnesses generative adversarial training to fit | + | |
| - | distributions of states and actions defining expert behavior. We test our algorithm in Section 6, where | + | |
| - | we find that it outperforms competing methods by a wide margin in training policies for complex, | + | |
| - | high-dimensional physics-based control tasks over various amounts of expert data. | + | |