reinforcement learning - Episodic Semi-gradient Sarsa with Neural Network -
while trying implement episodic semi-gradient sarsa with neural network approximator wondered how choose optimal action based on learned weights of network. if action space discrete can calculate estimated value of different actions in current state , choose 1 gives maximimum. seems not best way of solving problem. furthermore, not work if action space can continous (like acceleration of self-driving car example).
so, basicly wondering how solve 10th line choose a' function of q(s', , w) in pseudo-code of sutton: 
how these problems typically solved? can 1 recommend example of algorithm using keras?
edit: need modify pseudo-code when using network approximator? so, minimize mse of prediction of network , reward r example?
i wondered how choose optimal action based on learned weights of network
you have 3 basic choices:
run network multiple times, once each possible value of a' go s' value considering. take maximum value predicted optimum action (with probability of 1-ε, otherwise choose randomly ε-greedy policy typically used in sarsa)
design network estimate action values @ once - i.e. have |a(s)| outputs (perhaps padded cover "impossible" actions need filter out). alter gradient calculations slightly, there should 0 gradient applied last layer inactive outputs (i.e. not matching a of (s,a)). again, take maximum valid output estimated optimum action. can more efficient running network multiple times. approach used recent dqn atari games playing bot, , alphago's policy networks.
use policy-gradient method, works using samples estimate gradient improve policy estimator. can see chapter 15 of sutton , barto's current draft of reinforcement learning: introduction more details. policy-gradient methods become attractive when there large numbers of possible actions , can cope continuous action spaces (by making estimates of distribution function optimal policy - e.g. choosing mean , standard deviation of normal distribution, can sample take action). can combine policy-gradient state-value approach in actor-critic methods, can more efficient learners pure policy-gradient approaches.
note if action space continuous, don't have use policy-gradient method, quantise action. also, in cases, when actions in theory continuous, may find optimal policy involves using extreme values (the classic mountain car example falls category, useful actions maximum acceleration , maximum backwards acceleration)
do need modify pseudo-code when using network approximator? so, minimize mse of prediction of network , reward
rexample?
no. there no separate loss function in pseudocode, such mse see used in supervised learning. error term (often called td error) given part in square brackets, , achieves similar effect. literally term ∇q(s,a,w) (sorry missing hat, no latex on so) means gradient of estimator - not gradient of loss function.
Comments
Post a Comment