Skip to content

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO) and Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR).

License

Notifications You must be signed in to change notification settings

tttor/pytorch-a2c-ppo-acktr

 
 

Repository files navigation

study-ppo

fact

  • the return modulator in the loss, psi, is set to the advantage, A = Q - V
  • filter both reward and observ using VecNormalize()
  • most network params are shared between both actor and critic nets
  • do clip the gradient
  • use state value, NOT action-state value
  • reset is NOT called during rollout;
    • this is NOT similar with that of openai-baselines
    • in /home/tor/ws/baselines/baselines/acktr/acktr_cont_kfac.py
      • reset is at every beginning of run_one_episode()
  • NOT use concat_observ
  • plot return vs nstep, using
    • smothing: smooth_reward_curve(x, y)
    • fix_point(x, y, interval)
  • use Monitor(): baselines/baselines/bench/monitor.py
eprew = sum(self.rewards)
eplen = len(self.rewards)
epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}

question

  • why squeeze at dim=1: observ, reward, done, info = envs.step(action.squeeze(1).cpu().numpy())
  • why this becomes old_action_log_probs? there is not update yet, isnt?
data_generator = rollouts.feed_forward_generator(
advantages, self.num_mini_batch)

for sample in data_generator:
    observations_batch, states_batch, actions_batch, \
       return_batch, masks_batch, old_action_log_probs_batch, \
            adv_targ = sample

question: answered

  • why set return[-1]=next_value? why not set the last return to be (0 if done, else copy from the last value), like: vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1]) at /home/tor/ws/baselines/baselines/acktr/acktr_cont.py
    • seems to be more stable if we also use predicted value, rather than setting to 0 if terminal(which is true because it is absorbing states) or setting to prev value
    • this also related to the design of rollouts that is non stop (contagious) over all episodes
def compute_returns(self, next_value, use_gae, gamma, tau):
    ...
    else:
        self.returns[-1] = next_value
        for step in reversed(range(self.rewards.size(0))):
            self.returns[step] = self.returns[step + 1] * \
                gamma * self.masks[step + 1] + self.rewards[step]
  • why act() returns pred_state_value, in addition to act and act_log_prob: action, action_log_prob, pred_state_value = actor_critic_net.act(observ)
    • it is used to compute advantage: pred_advs = rollouts.returns[:-1] - rollouts.pred_state_values[:-1]
  • states? cf observation
    • seems only for atari, or image inputs
  • max_grad_norm?
    • for clipping the grad, before optim.step()
  • why adv computed this way? Q from empirical; V from prediction
    • thus, we have predicted advantage, only Q can be obtained empirically
    • true V is expectation over all actions
def update(self, rollouts, eps=1e-5):
    # Compute advantages: $A(s_t, a_t) = Q(s_t, a_t) - V(s_t, a_t)$
    advantages = rollouts.returns[:-1] - rollouts.value_preds[:-1]
  • what does this do? from openai-baselines: envs = VecNormalize(envs, gamma=args.gamma)
    • normalize and clip observ and reward
    • filter observ
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
    VecEnvWrapper.__init__(self, venv)
    self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
    self.ret_rms = RunningMeanStd(shape=()) if ret else None
  • this VecNormalize does not allow reset at every episode, unless the env is wrapped with Monitor(allow_early_reset=True)
  • mask used for?
    • masks = torch.FloatTensor([[0.0] if done_ else [1.0] for done_ in done])
    • needed because rollout is based on nstep, not using the notion of episode based on done
    • there are multiple process that may have different episode length
  • num_steps? for learning batch?
    • nstep per update
    • see: num_updates = int(args.num_frames) // args.num_steps // args.num_processes

todo

  • reset per episode
  • entropy info of action distrib
  • nprocess > 1
  • GAE
  • recurrent net
  • gym robotic env
  • cuda compatibility

setup

  • visdom
    • pip install visdom
    • python -m visdom.server

About

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO) and Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR).

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.4%
  • Shell 0.6%