study-ppo

fact

the return modulator in the loss, psi, is set to the advantage, A = Q - V
filter both reward and observ using VecNormalize()
most network params are shared between both actor and critic nets
do clip the gradient
use state value, NOT action-state value
reset is NOT called during rollout;
- this is NOT similar with that of openai-baselines
- in /home/tor/ws/baselines/baselines/acktr/acktr_cont_kfac.py
  - reset is at every beginning of run_one_episode()
NOT use concat_observ
plot return vs nstep, using
- smothing: smooth_reward_curve(x, y)
- fix_point(x, y, interval)
use Monitor(): baselines/baselines/bench/monitor.py

eprew = sum(self.rewards)
eplen = len(self.rewards)
epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}

question

why squeeze at dim=1: observ, reward, done, info = envs.step(action.squeeze(1).cpu().numpy())
why this becomes old_action_log_probs? there is not update yet, isnt?

data_generator = rollouts.feed_forward_generator(
advantages, self.num_mini_batch)

for sample in data_generator:
    observations_batch, states_batch, actions_batch, \
       return_batch, masks_batch, old_action_log_probs_batch, \
            adv_targ = sample

question: answered

why set return[-1]=next_value? why not set the last return to be (0 if done, else copy from the last value), like: vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1]) at /home/tor/ws/baselines/baselines/acktr/acktr_cont.py
- seems to be more stable if we also use predicted value, rather than setting to 0 if terminal(which is true because it is absorbing states) or setting to prev value
- this also related to the design of rollouts that is non stop (contagious) over all episodes

def compute_returns(self, next_value, use_gae, gamma, tau):
    ...
    else:
        self.returns[-1] = next_value
        for step in reversed(range(self.rewards.size(0))):
            self.returns[step] = self.returns[step + 1] * \
                gamma * self.masks[step + 1] + self.rewards[step]

why act() returns pred_state_value, in addition to act and act_log_prob: action, action_log_prob, pred_state_value = actor_critic_net.act(observ)
- it is used to compute advantage: pred_advs = rollouts.returns[:-1] - rollouts.pred_state_values[:-1]
states? cf observation
- seems only for atari, or image inputs
max_grad_norm?
- for clipping the grad, before optim.step()
why adv computed this way? Q from empirical; V from prediction
- thus, we have predicted advantage, only Q can be obtained empirically
- true V is expectation over all actions

def update(self, rollouts, eps=1e-5):
    # Compute advantages: $A(s_t, a_t) = Q(s_t, a_t) - V(s_t, a_t)$
    advantages = rollouts.returns[:-1] - rollouts.value_preds[:-1]

what does this do? from openai-baselines: envs = VecNormalize(envs, gamma=args.gamma)
- normalize and clip observ and reward
- filter observ

def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
    VecEnvWrapper.__init__(self, venv)
    self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
    self.ret_rms = RunningMeanStd(shape=()) if ret else None

this VecNormalize does not allow reset at every episode, unless the env is wrapped with Monitor(allow_early_reset=True)
mask used for?
- masks = torch.FloatTensor([[0.0] if done_ else [1.0] for done_ in done])
- needed because rollout is based on nstep, not using the notion of episode based on done
- there are multiple process that may have different episode length
num_steps? for learning batch?
- nstep per update
- see: num_updates = int(args.num_frames) // args.num_steps // args.num_processes

todo

reset per episode
entropy info of action distrib
nprocess > 1
GAE
recurrent net
gym robotic env
cuda compatibility

setup

visdom
- pip install visdom
- python -m visdom.server

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
algo		algo
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.md.ori		README.md.ori
arguments.py		arguments.py
distributions.py		distributions.py
distributions_tor.py		distributions_tor.py
enjoy.py		enjoy.py
envs.py		envs.py
main.py		main.py
main_tor.py		main_tor.py
model.py		model.py
model_tor.py		model_tor.py
ppo_tor.py		ppo_tor.py
requirements.txt		requirements.txt
run_reacher_ppo.sh		run_reacher_ppo.sh
storage.py		storage.py
storage_tor.py		storage_tor.py
utils.py		utils.py
utils_tor.py		utils_tor.py
visualize.py		visualize.py
visualize_tor.py		visualize_tor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

study-ppo

fact

question

question: answered

todo

setup

About

Releases 2

Packages

Languages

License

tttor/pytorch-a2c-ppo-acktr

Folders and files

Latest commit

History

Repository files navigation

study-ppo

fact

question

question: answered

todo

setup

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages