Amath Sow | Freelancer Portfolio Item #283722

Deep Reinforcement Learning from Variance-Reduced Policy-Dependant Human Feedback Amath SOW∗ Pr. Matthew E. Taylor† African Master in Machine Intelligence Accra, Ghana- director, Intelligent Robot Learning Lab Alberta University, Canada- ABSTRACT Human-Centered Reinforcement Learning (HCRL) aims to integrate human guidance with Reinforcement Learning (RL) algorithm in order to improve performance. An example of human guidance is real time binary feedback (’Good’ or ’Bad’) based on agent’s states and actions. One of the famous HCRL algorithms is coach [10] which is an actor-critic algorithm where the advantage function is a good model of human feedback that helps an agent to cope with human behaviors. In this paper, we consider a slight modification of COACH such that the human feedback can be interpreted as a reward signal typically present in RL. Therefore , we propose a new HCLR algorithm, Variance-Reduce COACH(VR-COACH) which interprets the human feedback as reward and apply variance reduction technique on policy gradient commonly used in RL. We experiments VR-COACH in the classic MountainCar environment and demonstrate that it learn faster that COACH and TAMER [8]. Moreover, In order to learn complex task in a reasonable time, we upgrade our original VR-COACH to his DEEP version (DEEP VR-COACH) where agent’s policy is represented as a deep neural network and apply Convolution Auto Encoder strategy, a Feedback Replay Buffer and entropy regularization. We, then demonstrate the effectiveness of DEEP VR-COACH in the rich Malmo Mine-craft environment while comparing with Deep COACH [3] and DEEP TAMER [18]. KEYWORDS Interactive learning; Reinforcement learning; Human-centered reinforcement learning, robotics 1 INTRODUCTION Nowadays, human-centered reinforcement-learning (HCRL) draw attention of many Reinforcement Learning researchers. HCRL refers to a problem that arises when an artificial agent interacts with an environment that can be described as a modified Markov decision process (MDP), and the agent’s goal is to adapt its behavior to match the desires of a human trainer. More specifically, during the agent-environment interaction, the human trainer can communicates with the agent through human feedback (e.g., numerical ∗ African Master in Machine Intelligence Robot Learning Lab † Intelligent Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021),May 3–7, 2021,London, UK,U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), 2021 © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. Environment Agent Figure 1: Generic HCRL framework. The tuple (𝑠, 𝑎, 𝑠 0 ) denotes a transition of the environment from state 𝑠 to 𝑠 0 when the agent takes an action 𝑎. A human observes these transitions (𝑠, 𝑠 0 ) and gives feedback ℎ. The solid lines denote interactions at every timestep while the dotted line activates only when the human gives feedback. values) in an attempt to teach the agent how to perform the desired behavior. The fundamental objective of HCRL is for the agent to utilize the human’s feedback to learn its goal. The benefits of successful HCRL are numerous, including enabling the real-time learning of unique, trainer-specific agent behaviors, learning from non-expert trainers, and, even for tasks for which we already have predefined reward functions, providing a mechanism by which we can increase the speed of agent learning. [8, 10, 18] compared to traditional reinforcement learning (RL) [17]. As the problem name suggests, most work for algorithmic approaches capable of performing HCRL has focused on applying and adapting techniques from classical RL. In doing so, algorithm designers must decide what, exactly, the human feedback signal means within the RL context. Thus far, this has been handled by equating the feedback with specific RL quantities. To illustrate this point, consider two popular examples of HCRL techniques: training an agent manually via evaluative reinforcement (tamer) [8] and convergent actor-critic by humans (coach) [10]. In tamer, agents employ a supervised value-function learning technique. In this context, the human feedback is interpreted to be the value of recent state-action pairs under the policy that the human desires (agent behavior is based upon the learned value function). Agents that use coach, on the other hand, take a policy-gradient approach to behavior learning. In that context, the human feedback is interpreted to be the advantage (i.e., the difference between the expected return for taking a particular action versus the expected return when taking the optimal action) under the agent’s current policy. AAMAS’21, 2021 It is important to note that, even though the discussion surrounding HCRL techniques uses terms such as reward, value, advantage, and so on, these terms are used only to convey the interpretation of the human feedback in the context of classical RL techniques. In particular, because human feedback is typically non-stationary and inconsistent, the usual ways in which the above terms are understood do not apply in the HCRL setting. For example, since a stationary underlying reward function does not exist in HCRL, there are no notions of optimality with respect to reward, value, policy, and so on. Both HCRL techniques discussed above (tamer and coach) use different learning algorithms and different interpretations of human feedback, yet both have been used to successfully solve HCRL problems in certain domains. Therefore, it would seem that both the question of what learning algorithm to use and the question of how to interpret human feedback are still far from settled. In this paper, we propose a new algorithm called variance-reduced convergent actor-critic by humans (vr-coach), which includes a critic in the learning framework and exhibits increased learning speed and stability compared to other HCRL approaches. Our work is divided into three distinct parts. First, we analyze the real-time variant of coach and derive an alternative interpretation of that algorithm in which the agent can be understood to be executing the classical reinforce algorithm with the human feedback being interpreted as the reward rather than advantage. Second, motivated by this interpretation, we add to coach a commonly-used variancereduction technique from the RL literature and study whether or not it can improve agent performance in the HCRL setting. Third, we upgrade our original VR-COACH to its DEEP version in order to learn complex tasks in a reasonable time. The primary result of the paper is theoretical in nature and provides a basis for using variance-reduction techniques in HCRL. As a proof-of-concept study, we experimentally evaluate our technique in the context of the classical Mountain Car environment while comparing to coach and (tamer. We conduct also an evaluation of our deep version, DEEP VR-COACH against DEEP-COACH and DEEP TAMER on Malmo Minecraft Environment and Bowling Atari game using respectively simulated human trainer and real human trainer. 2 RELATED WORK Our approach is directly inspired by coach algorithm. which is an actor-critic based model where the advantage function is considered as a good model of human feedback. In fact, with a series of demonstration , the authors proof that human feedback has some properties that are inconsistent with traditional reward signal. For example, a reward function will continuously give positive feedback if the agent demonstrate good behaviors whereas human trainer will reduce feedback when the agent start following good behaviors. Therefore human trainer is less likely to give redundant feedback. So the human feedback ca be considered as an evaluation of the agent’s action choice in the context of his current behavior. That’s why in COACH, interaction between learning agent and human trainer is considered as an actor-critic algorithm where the human trainer represents the critic that evaluate actor’s policy. The authors introduce real-time COACH to address issues about sparseness of Amath SOW and Pr. Matthew E. Taylor human feedback and where the empirical results use hand coded images features detectors using linear function approximation to learn policy. The TAMER framework [8] is another HCLR algorithm which allows an agent to learn from a human trainer through a series of critiques. It use a regression function to represent the reward function consistent with the feedback signals provided by a human trainer. TAMER framework has proven efficient on several limited tasks. Recently, [18] propose DEEP TAMER, which use deep neural network on top of TAMER framework. It integrate many Reinforcement Learning techniques which permit the algorithm to achieve satisfactory performance on a chosen ATARI game task of Bowling. In fact, beyond the usage of deep neural network function approximation, DEEP TAMER differs from the original TAMER in several ways. First, in the loss function, DEEP TAMER minimizes a weighted difference between the human reward and the predicted value for each state-action pair. Second, the author integrate a deep auto-encoder that reduce the number of parameters and learning time. The lastly difference is the frequency of learning. TAMER learns one from each state-action pair, while DEEP TAMER can learn from each multiple thanks to the feedback replay buffer. There is another HCLR algorithm that was inspired to TAMER framework proposed by [1]. Specifically, they collect data from human observing and noting preferences between agent trajectories and use these collected data to asynchronously learn a reward model. Indeed , even with thousands of samples collected from human feedback, the algorithm requires millions of steps to converge to a satisfactory policy for a certain task.Our approach differs in that we apply reinforcement learning (respectively deep reinforcement learning) based on direct human feedback considered as reward instead of attempting to learn human preferences. Others related works [13] and [14] consider human feedback as a label of actions optimality. These policy shaping approaches integrate information about the human trainer based on observed feedback to improve learning. Lastly, we make a distinction between the HCLR setting presented in this work and the learning from demonstration paradigm [12] where the agent is provided a dataset of demonstration which capture a desired behavior. Based on the dataset, the agent is meant to solve the policy that best describe the observed data. The work we present here studies first, whether or not using an RL variance-reduction strategy in the policy update step of coach results in better HCRL performance. In the context of variance reduction in policy gradient techniques, there have been several works since reinforce [19], such as introducing a control variate [4], learning a Q-function (deterministic policy gradients) [9] and expected policy gradients [2]. While any of these directions provide an opportunity to reduce variance in HCRL especially in coach, in this paper, we will study the effect of utilizing the control variate approach. Second, we add deep neural network on top of VR-COACH that we call DEEP VR-COACH and we do our experiment using Malmo project platform [2] which allows for the creation and deployment of AI experiments within Minecraft and Bolwing atari game using respectively simulated human feedback and real human feedback. Deep Reinforcement Learning from Variance-Reduced Policy-Dependant Human Feedback 3 PROBLEM STATEMENT In this paper, we are concerned with answering the 2 following questions: what is the impact of considering the human feedback as reward in HCLR while integrating variance reduced technique on agent’s policy gradient? what is the effectiveness of using deep Reinforcement Learning techniques on top of the VR-COACH algorithm in order to learn in high dimensional observations. Let’s consider an MDP(Marcov Decision Process)[11] for representing the underlying sequential decision-making problem. In more detail, an MDP is a tuple (𝑆, 𝐴, 𝑃, 𝐹, 𝛾), where 𝑆 is a set of states, 𝐴 is a set of actions, 𝑃 : 𝑆 × 𝐴 → D(𝑆) (where D(𝑆) is the space of probability distributions over 𝑆) is a state transition distribution, F is the feedback that replace the reward function R ,𝛾 is a discount factor. The agent selects an action at each timestep t following a policy 𝜋 in order to maximize a total discounted sum of reward. In this context, we consider 𝜋 as a stochastic policy parameterized by 𝜃 𝑡 , denoted 𝜋𝜃𝑡 , which defines a probability distribution over all actions given the current state. In VR-COACH algorithm, even if it exists the environment reward, we use the feedback signal of a human trainer at each timestep, 𝑓𝑡 ∈ {−1, 0, 1}. In MDP, there are two fundamental concepts that are widely used in RL setting which are, the state value function 𝑉 𝜋 and stateaction value function 𝑄 𝜋 . The value function defines the expected future discounted reward from each state when following some policy and the state-action value function refers to the expected future discounted reward when an agent takes some action in some state and then follows some policy thereafter. We recursively define Í through the Bellman Equations : 𝑉𝜋 (𝑠) = 𝑎 𝜋 (𝑠, 𝑎)𝑄 𝜋 (𝑠, 𝑎) and Í 𝜋 0 0 𝜋 0 𝑄 (𝑠, 𝑎) = 𝑠 0 𝑃 (𝑠 |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠 ) + 𝛾𝑉 (𝑠 )] One of the most famous policy-based algorithm is actor-critic algorithm [16] where the actor is a model parameterized by 𝜋𝜃𝑡 for action selection whereas the critic parameterized either by 𝑉 𝜋𝜃 𝑡 (𝑠, 𝑎) or 𝑄 𝜋𝜃 𝑡 (𝑠, 𝑎) is another model that estimates the value function at each timestep for the actor and provides critiques that are used to update the policy parameters at the end of each episode. The critic is the Temporal difference(TD) error which is defined by the bellow equation : 𝛿𝑡 = 𝑓𝑡 + 𝛾𝑉 (𝑠𝑡 ) − 𝑉 (𝑠𝑡 −1 ). In VR-COACH we interpret 𝑓𝑡 as a reward function. 4 where 𝜋 is the agent’s current policy, and 𝑎𝑡 and 𝑠𝑡 , are the action and state observed at time 𝑡, respectively. Let consider the policy gradient theorem [17], the gradient for a single timestep is given by: Δ𝜃 𝑡 = 𝛼∇𝜃𝑡 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜋 (𝑠𝑡 , 𝑎𝑡 ) , (2) In coach, The agent’s policy is directly modified by the human feedback without using any critic component. Then the gradient update become: Δ𝜃 𝑡 = ∇𝜃𝑡 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )ℎ𝑡 , (3) where ℎ𝑡 is the human feedback observed at time 𝑡 that replace the advantage function. Indeed, the implementation of the algorithm above for real time use is problematic due to the sparsity of human feedback. This issue is addressing through the development of Real time coach by using an eligibility trace to help apply feedback to the relevant transitions. An eligibility trace is a vector that keeps track of the policy gradient and decays exponentially with a parameter 𝜆 and is updated for each timestep according to : 𝑒𝑡 = 𝜆𝑒𝑡 −1 + ∇𝜃𝑡 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) , (4) Policy parameters are then updated in the direction of the trace , giving more consideration on recents actions than the older ones. The policy gradient for Real time COACH become : Δ𝜃 𝑡 = 𝑒𝑡 ℎ𝑡 , 4.2 (5) Episodic coach Let now consider a slight modification of real-time coach and assume that Δ𝜃 𝑡 is computed at each timestep but the gradient update is done at the end of the episode (i.e., 𝑇 timesteps have elapsed). VR-COACH In this section, we first analyze an existing HCRL framework, coach, an actor-critic based algorithm from policy dependant feedback. We also present Episodic COACH which is a minor modification of COACH where we consider gradient update at the end of each episode instead of each time step. Based on the similarity of this variant of COACH and the classical reinforce algorithm from the RL literature, we end up with a new algorithm Variance-Reduced COACH in which we consider the human feedback as reward and applying at the same time technique that reduce variance on the policy gradient. 4.1 AAMAS’21, 2021 Δ𝜃 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 = 𝐴𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄 𝜋 (𝑠𝑡 , 𝑎𝑡 ) − 𝑉 𝜋 (𝑠𝑡 ) , (1) 𝑒𝑡 ℎ𝑡 , (6) 𝑡 =0 where 𝑒𝑡 = 𝜆𝑒𝑡 −1 + ∇𝜃 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) (i.e., we now differentiate with respect to the policy parameters 𝜃 at the start of the episode rather than 𝜃 𝑡 ). Examining the eligibility trace itself, we see that it can be written as coach COACH [10] is an actor-critic algorithm when the Advantage function is considered as a good model of human feedback. 𝑇 Õ 𝑒𝑡 = 𝑡 Õ 0 𝜆𝑡 −𝑡 ∇𝜃 ln 𝜋𝜃 (𝑎𝑡 0 |𝑠𝑡 0 ) , 𝑡 0 =0 and the episodic update can be rewritten as, (7) AAMAS’21, 2021 Δ𝜃 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 = = 𝑇 Õ 𝑡 =0 𝑇 Õ Amath SOW and Pr. Matthew E. Taylor (8) 𝑒𝑡 ℎ𝑡 ℎ𝑡 𝑡 Õ Actor 𝜆 𝑡 −𝑡 0 ∇𝜃 ln 𝜋𝜃 (𝑎𝑡 0 |𝑠𝑡 0 ) (9) 𝑡 0 =0 𝑡 =0 h = ℎ 0 ∇𝜃 ln 𝜋𝜃 (𝑎 0 |𝑠 0 ) + ℎ 1 𝜆∇𝜃 ln 𝜋𝜃 (𝑎 0 |𝑠 0 ) i + ∇𝜃 ln 𝜋𝜃 (𝑎 1 |𝑠 1 ) + · · · (10) ET(λ) h = ∇𝜃 ln 𝜋𝜃 (𝑎 0 |𝑠 0 ) ℎ 0 + 𝜆ℎ 1 + · · · i + ∇𝜃 ln 𝜋𝜃 (𝑎 1 |𝑠 1 ) ℎ 1 + 𝜆ℎ 2 + · · · + · · · = 𝑇 Õ ∇𝜃 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) ℎ𝑡 + 𝜆ℎ𝑡 +1 𝑡 =0 2 (11) (a) coach (12) + 𝜆 ℎ𝑡 +2 + · · · , Actor In (12) equation , we observe that it similar to the gradient update in reinforce [19] except that we have the human feedback feedback in place of the reward and the eligibility decay 𝜆 in place of the discount factor We, therefore, arrive at a new interpretation of this episodic variant of real-time coach: that it is the classical reinforce algorithm where the human feedback signal is interpreted as the reward. 4.3 Critic vr-coach REINFORCE belongs to a special class of Reinforcement Learning algorithms called Policy Gradient algorithms. Thank to his similarity to Episodic COACH, we think that it may provide benefit in the HCLR setting. Particularly, we are interested in whether reducing the policy-gradient variance based on control variate are able to improve the learning speed and stability of COACH algorithm. We thereby build another HCLR algorithm called VR-COACH. VR-COACH is an actor-critic based algorithm in which the human feedback is interpreted as reward. Compared to COACH, VRCOACH reintegrates the critic part that maintains an approximation of the value function 𝑉𝑤 , where 𝑤 represents the function parameters, based on observations of the human’s feedback. As in [4], vr-coach, uses this value function output as a control variate baseline in the computation of return samples which results in a modified estimate of the policy gradient. The actor and critic update for VR-COACH are given by the following equations: Δ𝜃 𝑡 = 𝛼∇𝜃 ln 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) ℎ𝑡 + 𝛾𝑉𝑤 (𝑠𝑡 +1 ) − 𝑉𝑤 (𝑠𝑡 ) , (13) Δ𝑤𝑡 = ℎ𝑡 + 𝛾𝑉𝑤 (𝑠𝑡 +1 ) − 𝑉𝑤 (𝑠𝑡 ) ∇𝑤 𝑉𝑤 (𝑠𝑡 ) (14) The Fig. 2 illustrate the difference between COACH and VRCOACH. So in COACH, when no feedback is given , the actor doesn’t learn anything while in VR-COACH, the actor still learn from the critic even if the human doesn’t provide any feedback, i.e., (b) vr-coach Figure 2: Illustration of the agents used in the HCRL framework (Fig. 1) for coach and vr-coach. The tuple (𝑠, 𝑎, 𝑠 0 ) denotes a transition from state 𝑠 due to action 𝑎 and that lands in the new state 𝑠 0 . The human feedback is denoted by ℎ. The solid lines denote interactions at every timestep while the dotted lines occur intermittently whenever the human gives feedback. the critic learns to predict the discounted sum of feedbacks given by the human and the actor learns from this critic. 5 DEEP VR-COACH In order to be able to learn agents in environments with complex state spaces, we upgrade VR-COACH to its deep version by applying a series of modifications on the top of the algorithm. Therefore, the actor’s and critic’s policies are represented by deep neural network. To reduce the number of parameter and training time, we use a Convolutional Auto Encoder where the output is used as input of the two policies networks(actor and critic networks). We add also a feedback replay buffer that store experience as dataset that can help the agent to learn the policy in a limited amount of feedbacks. We also integrate an entropy regularization strategy that increase Deep Reinforcement Learning from Variance-Reduced Policy-Dependant Human Feedback AAMAS’21, 2021 Algorithm 1 VR-COACH(𝜋𝜃 0 , 𝑉𝑤0 ,𝛾, 𝛼, 𝛽, 𝑇 ) Algorithm 2 DEEP VR-COACH algorithm Input: 𝜋𝜃 0 → The initial policy with parameters 𝜃 0 𝑉𝑤0 → The initial value function with parameters 𝑤 0 𝛾 → Discount factor 𝛼 → Actor learning rate 𝛽 → Critic learning rate 𝑇 → Maximum timesteps Output: 𝜋𝜃𝑇 → The learned policy with parameters 𝜃𝑇 , available after 𝑇 timesteps 1: Observe initial state 𝑠 0 2: for 𝑡 = 0 : 𝑇 do 3: Sample and execute action 𝑎𝑡 ∼ 𝜋𝜃𝑡 (· | 𝑠𝑡 ) Observe next state, 𝑠𝑡 +1 and human feedback, ℎ𝑡 ⊲ ℎ𝑡 = 0 if 4: no feedback 5: 𝜃 𝑡 +1 ← 𝜃 𝑡 + 𝛼 ℎ𝑡 + 𝛾𝑉 (𝑠𝑡 +1 ) − 𝑉 (𝑠𝑡 ) ∇𝜃𝑡 ln 𝜋𝜃𝑡 (𝑎𝑡 |𝑠𝑡 ) 6: 𝑤𝑡 +1 ← 𝑤𝑡 + 𝛽 ℎ𝑡 + 𝛾𝑉 (𝑠𝑡 +1 ) − 𝑉 (𝑠𝑡 ) ∇𝑤𝑡 𝑉 (𝑠𝑡 ) Input: Pretrained Convolutional encoder parameters 𝜃 and w, 𝜋𝜃 0 → The initial policy with parameters 𝜃 0 𝑉𝑤0 → The initial value function with parameters 𝑤 0 𝑑 → human delay 𝐿 → window size 𝑚 → minibatch size 𝛾 → Discount factor 𝛼 → Actor learning rate 𝛽 → Critic learning rate 𝑇 → Maximum timesteps 𝜌 → entropy regularisation coefficient 𝑊 : {} → initialize window 𝐵 : ∅ → initialize replay buffer Output: 𝜋𝜃𝑇 → The learned policy with parameters 𝜃𝑇 , available after 𝑇 timesteps 1: Observe initial state 𝑠 0 2: for 𝑡 = 0 : 𝑇 do 3: Sample and execute action 𝑎𝑡 ∼ 𝜋𝜃𝑡 (· | 𝑠𝑡 ) 4: Record 𝑝𝑡 → 𝜋𝜃𝑡 (𝑎𝑡 | 𝑠𝑡 ) 5: Observe next state, 𝑠𝑡 +1 and human feedback, ℎ𝑡 ⊲ ℎ𝑡 = 0 if no feedback 6: Append (𝑠𝑡 −𝑑 , 𝑎𝑡 −𝑑 , 𝑝𝑡 −𝑑 , 𝑠𝑡 +1−𝑑 , ℎ𝑡 ) to the end of W 7: if ℎ𝑡 is not Null then 8: take L most recent entries of W and append to B 9: 𝑊 ← {} 10: Randomly sample a minibatch N of m windows from B 11: for 𝑛 ∈ 𝑁 do 12: 𝑔←0 13: for 𝑠, 𝑎, 𝑝, 𝑠 0, ℎ ∈ 𝑛 do 14: compute V(s’) and V(s) 15: 𝛿 = ℎ + 𝛾𝑉 (𝑠 0 ) − 𝑉 (𝑠) the uncertainty(more exploration) and avoid the agent to stuck in local minima. 5.1 Convolutional Auto Encoder Given the desires for learning quickly over high dimensional observations, we use a Convolutional Auto Encoder(CAE) [5] for our policies networks. By definition a CAE is a pair of functions (𝑓 𝜃𝑚 , 𝑔𝜃𝑛 ) where 𝑓 𝜃𝑚 is an encoder mapping a raw observation x to low dimensional observation and 𝑔𝜃𝑛 is the decoder of 𝑓 𝜃𝑚 in order to reconstruct the original observation x. The parameters of the encoder 𝜃𝑚 and the decoder 𝜃𝑛 are found by minimising the reconstruction loss given a minibatch of k samples: 𝜋 16: 17: 𝐿(𝑥) = 1/𝑛 𝑘 Õ (𝑔𝜃𝑛 (𝑓 𝜃𝑚 (𝑥𝑖 ) − 𝑥𝑖2 ) (15) 𝑖=0 18: 19: (𝑎 | 𝑠) 𝑔 ← 𝑔 + 𝛿 𝜃𝑡 𝑑 ∇𝜃𝑡 ln 𝜋𝜃𝑡 (𝑎|𝑠) 𝑤𝑡 +1 ← 𝑤𝑡 + 𝛽𝛿∇𝑤𝑡 𝑉 (𝑠𝑡 ) 𝑔 ← 𝑚1 𝑔 + 𝜌∇𝜃𝑡 𝐻 (𝜋𝜃𝑡 (.|𝑠𝑡 )) 𝜃 𝑡 +1 ← 𝜃 𝑡 + 𝛼𝑔 In practice, the encoder permit to pass from high dimensional observations to low dimensional observations while preserving relevant features of the input that are important for the reconstruction with high fidelity. each uniformly sampled window from the buffer, we apply gradient update and perform the training. 5.2 5.3 Feedback Replay Buffer Even if we initialize our policy with encoded observations, due to the fully connected layers that comprise the second half of 𝑓 𝜃𝑚 we still have an important numbers of parameters to learn. In order to learn quickly with a minimum number of feedbacks, we construct a dataset of experience : D = (s,a,f,s’,p(a/s)) where s is the actual state, a is the action and the agent transit to the next state s’, f is the human feedback, p(a/s) give the probability of taking an action on a given state. In DEEP VR-COACH, we use experience window which is a transition between 2 non zeros feedbacks i.e each time a human give feedback to the agent, he complete an entire window that is then stored to the Replay Buffer for futures training update. For Entropy Regularization Here, our agent greedly selects the action that have the highest probability under the current policy at each time step instead of randomly sampling from the distribution. To do so, it may happen that our agent still choosing the same action because it produce some positive reward. It could exist another action that can give a higher reward but the agent will never try it because it will just exploit what it has already learn. In this situation, the agent get stuck to local optimum and will never reach the global one. To avoid that, we use entropy regularization to encourage exploration and avoid getting stack to local optima. We employ entropy regularization[6] of the form 𝛽Δ𝜃𝑡 𝐻 (𝜋𝜃𝑡 (.|𝑠𝑡 )) where 𝛽 is the regularization coefficient that is used to maintain a high entropy policy. AAMAS’21, 2021 Figure 3: Classic MountainCar 6 Figure 4: Comparison between VR-COACH COACH and TAMER with simulated human using mean and standard error over 30 runs of each algorithm EXPERIMENTS In this section, we are going to demonstrate the effectiveness of VR-COACH in HCLR setting. To do so, we first perform a small proof of concept on VR-COACH on classical MountainCar domain and see whether adding variance reduction strategy on policy gradient presents any advantage with regard to others HCLR algorithms(COACH, TAMER [8]). We also perform a test of DEEP VR-COACH on a rich 3D Mincraft Malmo environment and Bowling atari game using respectively simulated human trainer and real human trainer and compare against DEEP COACH[3] and DEEP TAMER [18] . 6.1 Amath SOW and Pr. Matthew E. Taylor VR-COACH Experiment 6.1.1 Classical MountainCar: It’s a classic RL problem when the objective is to build an algorithm that learns to climb a steep hill to reach the goal marked by yellow flag (Fig. 3). This is not an easy task because the car’s engine is not powerful enough to drive up the hill without a head start so the car must drive up the left hill to obtain enough momentum to scale the steeper hill to the right and reach the goal. 6.1.2 Proxy Human Feedback Strategy: We build a program that can replace the human (called proxy human) and give feedback based on agent’s behavior. In the context of MountainCar, it consist of giving positive feedback(+1) if the action taken is along the current velocity and negative feedback(-1) otherwise. 6.1.3 Experiment details and results: The inputs to both the policy and value networks are normalized over a window of 200 previous data points. Both the eligibility trace decay term (𝜆) and the discount factor (𝛾) were set to 0.95. For the experiment, both coach and vr-coach in this domain use the same artificial neural network to represent the policy. We use a critic network in the case of vr-coach.Our policy networks are fully-connected, and use 16 hidden units and relu activation functions.We take a soft-max on this 3-dimensional output to get the probability of picking each action. The weights of the last layer of the policy are initialized with small values in order to get an initial policy with high entropy.. We use a learning rate of 𝛼𝑎𝑐𝑡𝑜𝑟 = 0.0025 and 𝛼𝑐𝑟𝑖𝑡𝑖𝑐 = 0.01, a discount factor of 𝛾 = 0.99 and the eligibility trace decay is = 0.95. For TAMER’s reward function approximation, we use the same parameters in coach. The Fig. 4 presents comparison between all three HCRL algorithms over 30 trials. We can observe that VR-COACH solve the MountainCar problem faster than COACH and TAMER while being able to get a mean reward around -120. In fact, from the beginning VR-COACH’s agent is able to reach the goal within around 120 steps whereas COACH’s agent is trying to get the good behavior and TAMER’s agent solve the problem within 150 steps. After around 15mn of training, all three algorithm manage to reach the goal within a mean reward of -120, -140 and -150 for respectively VR-COACH. COACH and TAMER. We can see that, in the case of VR-COACH how variance reduction techniques in Reinforcement Learning setting can help to adjust the policy gradient and accelerate the learning. 6.2 DEEP VR-COACH Experiment 6.2.1 Goal Navigation Task on Malmo/Minecraft: The agent here is randomly placed in a 10 × 10 grid facing a single gold block in the center of the room (Fig. 5) and should navigate from its start location to the gold block. The environment reward structure for the task silently provides a reward of +200 to the agent for reaching the gold block while each step taken by the agent has a cost of 1. Each episode runs until the agent finds the goal or until the agent reaches a quota of 200 steps. A. Proxy Human Feedback Strategy: For the goal navigation task, we build a simulated human (a program) that give feedback based on the behavior of the agent. To do so, we compare the the distance from the goal of the actual agent state against the previous one. B. Experiment details and results: For the experiment, we first train a Convolutional Auto Encoder(CAE) from a dataset of 20 000 images collected by executing a random policy and then use that CAE to initialize our actor and critic networks. All observations were represented by 84x84 images. For policy optimization, Deep Reinforcement Learning from Variance-Reduced Policy-Dependant Human Feedback Figure 5: Malmo: Goal Navigation Task AAMAS’21, 2021 Figure 7: Bowling Atari game the target(gold). Furthermore, we observe that DEEP VR-COACH always reach the goal within a minimum cost compare to DEEP TAMER and DEEP COACH. So, it turn out, in DEEP VR-C0ACH since the agent can learn through the critic even if any feedback is given , the agent’s policy converge faster to the good behavior. Figure 6: Mean episodic reward for the goal navigation task we use RMSProp[15] for the actor policy and Adam[7] for the critic policy network. We use a learning rate 𝛼𝑎𝑐𝑡𝑜𝑟 = 0.00025 and 𝛼𝑐𝑟𝑖𝑡𝑖𝑐 = 0.001, a mini batch size of 16, a discount factor 𝛾 = 0.99, an entropy regularization coefficient 𝛽 = 1.5 and window size L=10. We applied gradient clipping to avoid exploding gradient and then share some training variables between actor and critic networks. For DEEP COACH, we use the same CAE to initialize the policy network and the same hyperparameters as the DEEP VR-COACH’s actor network. The DEEP TAMER reward network is identical to DEEP VR-COACH policy network with the only difference is the utilization of a final softmax activation function. Adam optimizer is used to optimize the network with the same learning rate as DEEP COACH, with a buffer update interval of 10 and a credit assignment interval of [0.2, 4.0]. The Fig. 6 presents the mean episodic environment reward obtained by all algorithms over ten trials of goal navigation task. We observe that in the early episodes, both DEEP TAMER and DEEP COACH has a large degree of variance when acquiring the simulated human feedback whereas DEEP VR-COACH seem to start accommodating to the good behavior. After few episodes of learning, all algorithm agents arrive at a policy that consistently reach 6.2.2 Atari Environment: Bowling game. It’s an Atari game (7) which use pixel level state where the bowler can take action like (throw, No-op, Move up and Move down) and the goal is to knock down as much pins as possible. The bowler has two throws to knock down all pins. Possibles results could be strike(the bowler knock down all pins on the first throw), sparse(the bowler knock down all pins on the second throw) and open frame(the bowler knock down less than 10 pins). A. Real human feedback strategy on Bowling: We assume that the human trainer is familiar with the game of Bowling. Therefore the human can train the agent to play the game and give binary feedback(+1 -> "good" , -1 -> "bad") based on the behavior of the agent. B. Experiment details and results on Bowling game: We use the same hyperparameters as we have in the Goal navigation task on Malmo. The Fig 8 presents mean episodic reward on the game of Bowling obtained by all algorithms over 10 humans trainers. At the beginning, all algorithms have high variance when receiving human feedback. After few episodes of training, DEEP TAMER is able to have a score greater than 100 whereas DEEP VR-COACH and DEEP COACH still improving slowly. This behavior is due to the sparsity of the reward on Bowling game that avoid the agent to cope with the good policy. After 15mn of training, we observe that DEEP TAMER perform better than DEEP VR-COACH and DEEP COACH. The sparsity and delayed reward on Bowling make somehow difficult for DEEP VR-COACH and DEEP COACH, actor-critic algorithms in general to learn quickly. To improve result, we need to add some RL techniques(reward clipping/scaling, more exploration, etc) to compensate theses two aspects. AAMAS’21, 2021 Figure 8: Mean episodic reward on Bowling Atari Game 7 CONCLUSION AND FUTURE WORK In this paper, we present a new HCRL algorithm, an extension of episodic COACH that interprets human feedback as reward and incorporate variance reduction policy gradient techniques commonly used in RL. We then experimentally validated whether VR-COACH would lead to performance improvements in the HCRL setting by performing a proof-of-concept study on Classic MountainCar environment with simulated and human trainer. We found that , it lead to an improvement of the learning speed compare to COACH and TAMER. In order to learn complex task, we implement some deep reinforcement learning techniques on top of VR-COACH and demonstrate the effectiveness of DEEP VR-CAOCH compare to DEEP TAMER and DEEP COACH using simulated human feedback. Indeed , we finish our experiment on Atari Bowling game using real human feedback. In this case, DEEP TAMER outperforms DEEP VR-COACH and DEEP COACH. One possible direction of future work could be to study deep reinforcement learning techniques that can help to deal with sparsity and delayed reward, specifically on atari games and compare against DEEP TAMER which has proven to achieve height score after 15mn of training. REFERENCES [1] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Neural Information Processing Systems. [2] Kamil Ciosek and Shimon Whiteson. 2018. Expected policy gradients. AAAI Conference on Artificial Intelligence. [3] Sophie Saskin Michael Littman Dilip Arumugam, Jun Ki Lee. 2018. Deep RL from Policy Dependant Human Feedback. In Human-Centered Reinforcement Learning: A Survey. [4] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, Nov (2004),-. [5] Salakhutdinov Hinton, G E. 2006. Reducing the dimensionality of data with neural networks. In Human-Centered Reinforcement Learning: A Survey. [6] Ronald J. illiams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. In Connection Science. [7] Diederik P. Kingma and Jimmy. Adam Ba. 2014. A method for stochastic optimization.. In CoRR, abs/-. [8] W Bradley Knox and Peter Stone. 2008. TAMER: Training an agent manually via evaluative reinforcement. In IEEE International Conference on Development and Learning. Amath SOW and Pr. Matthew E. Taylor [9] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. In International Conference on Learning Representations. [10] James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. [11] Martin L. Puterman. 1994. Markov Decision Processes—Discrete Stochastic Dynamic Programming. In John Wiley Sons, Inc., New York, NY, 1994. [12] Chernova Sonia Veloso Manuela M. rgall, Brenna and Brett Browning. 2009. survey of robot learning from demonstration. Robotics and Autonomous Systems. In Aobotics and Autonomous Systems. [13] Subramanian Kaushik Scholz Jonathan Isbell Charles Lee riffith, Shane and Andrea Lockerd Thomaz. 2013. Integrating human feedback with reinforcement learning.. In Human-Centered Reinforcement Learning: Policy shaping. [14] Subramanian Kaushik Scholz Jonathan Isbell Charles Lee riffith, Shane and Andrea Lockerd Thomaz. 2015. Learning behaviors via humandelivered discrete feedback. In Autonomous Agents and Multi-Agent Systems. [15] Wolski Filip Dhariwal Prafulla Radford Alec Schulman, John and Oleg Klimov. 2017. Proximal policy optimization algorithms. In CoRR, abs/-, 2017. [16] McAllester David A. Singh Satinder P. Sutton, Richard S. and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. [17] Richard S Sutton and Andrew G Barto. 1998. Reinforcement Learning: An Introduction. Vol. 1. [18] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. 2018. Deep TAMER: Interactive agent shaping in high-dimensional state spaces. AAAI Conference on Artificial Intelligence. [19] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning. Springer, 5–32.