Deep Deterministic Policy Gradient (DDPG)

Overview

DDPG is a popular DRL algorithm for continuous control. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.

Original paper:

Continuous control with deep reinforcement learning

Reference resources:

sfujim/TD3

Implemented Variants

Variants Implemented	Description
`ddpg_continuous_action.py`, docs	For continuous action space. Also implemented Mujoco-specific code-level optimizations

Below are our single-file implementations of PPO:

`ddpg_continuous_action.py`

The ddpg.py has the following features:

For continuous action space. Also implemented Mujoco-specific code-level optimizations
Works with the Box observation space of low-level features
Works with the Box (continuous) action space

Usage

poetry install
poetry install -E pybullet
python cleanrl/ddpg_continuous_action.py --help
python cleanrl/ddpg_continuous_action.py --env-id HopperBulletEnv-v0
poetry install -E mujoco # only works in Linux
python cleanrl/ddpg_continuous_action.py --env-id Hopper-v3

Explanation of the logged metrics

Running python cleanrl/ddpg_continuous_action.py will automatically record various metrics such as various losses in Tensorboard. Below are the documentation for these metrics:

charts/episodic_return: episodic return of the game
charts/SPS: number of steps per second
losses/qf1_loss: the MSE between the Q values at timestep \(t\) and the target Q values at timestep \(t+1\), which minimizes temporal difference.
losses/actor_loss: implemented as -qf1(data.observations, actor(data.observations)).mean(); it is the negative average Q values calculated based on the 1) observations and the 2) actions computed by the actor based on these observations. By minimizing actor_loss, the optimizer updates the actors parameter using the following gradient (Lillicrap et al., 2016, Algorithm 1)¹:

\[ \nabla_{\theta^{\mu}} J \approx \frac{1}{N}\sum_i\left.\left.\nabla_{a} Q\left(s, a \mid \theta^{Q}\right)\right|_{s=s_{i}, a=\mu\left(s_{i}\right)} \nabla_{\theta^{\mu}} \mu\left(s \mid \theta^{\mu}\right)\right|_{s_{i}} \]

losses/qf1_values: implemented as `qf1(data.observations, data.actions).view(-1); it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over esitmations happen

Implementation details

Our ddpg_continuous_action.py is based on the OurDDPG.py from sfujim/TD3, which presents the the following implementation difference from (Lillicrap et al., 2016)¹:

ddpg_continuous_action.py uses a gaussian exploration noise \(\mathcal{N}(0, 0.1)\), while (Lillicrap et al., 2016)¹ uses Ornstein-Uhlenbeck process with \(\theta=0.15\) and \(\sigma=0.2\).
ddpg_continuous_action.py runs the experiments using the openai/gym MuJoCo environments, while (Lillicrap et al., 2016)¹ uses their proprietary MuJoCo environments.

ddpg_continuous_action.py uses the following architecture:

class QNetwork(nn.Module):
    def __init__(self, env):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod() + np.prod(env.single_action_space.shape), 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 1)

    def forward(self, x, a):
        x = torch.cat([x, a], 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


class Actor(nn.Module):
    def __init__(self, env):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc_mu = nn.Linear(256, np.prod(env.single_action_space.shape))

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return torch.tanh(self.fc_mu(x))

while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)¹ uses the following architecture (difference highlighted):

class QNetwork(nn.Module):
    def __init__(self, env):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400)
        self.fc2 = nn.Linear(400 + np.prod(env.single_action_space.shape), 300)
        self.fc3 = nn.Linear(300, 1)

    def forward(self, x, a):
        x = F.relu(self.fc1(x))
        x = torch.cat([x, a], 1)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


class Actor(nn.Module):
    def __init__(self, env):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400)
        self.fc2 = nn.Linear(400, 300)
        self.fc_mu = nn.Linear(300, np.prod(env.single_action_space.shape))

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return torch.tanh(self.fc_mu(x))

ddpg_continuous_action.py uses the following learning rates:

q_optimizer = optim.Adam(list(qf1.parameters()), lr=3e-4)
actor_optimizer = optim.Adam(list(actor.parameters()), lr=3e-4)

while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)¹ uses the following learning rates:

q_optimizer = optim.Adam(list(qf1.parameters()), lr=1e-4)
actor_optimizer = optim.Adam(list(actor.parameters()), lr=1e-3)

ddpg_continuous_action.py uses --batch-size=256 --tau=0.005, while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)¹ uses --batch-size=64 --tau=0.001

Experiment results

PR vwxyzjn/cleanrl#137 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/ddpg.

Below are the average episodic returns for ddpg_continuous_action.py (3 random seeds). To ensure the quality of the implementation, we compared the results against (Fujimoto et al., 2018)².

Environment	`ddpg_continuous_action.py`	`OurDDPG.py` (Fujimoto et al., 2018, Table 1)²	`DDPG.py` using settings from (Lillicrap et al., 2016)¹ in (Fujimoto et al., 2018, Table 1)²
HalfCheetah	9260.485 ± 643.088	8577.29	3305.60
Walker2d	1728.72 ± 758.33	3098.11	1843.85
Hopper	1404.44 ± 544.78	1860.02	2020.46

Info

Note that ddpg_continuous_action.py uses gym MuJoCo v2 environments while OurDDPG.py (Fujimoto et al., 2018)² uses the gym MuJoCo v1 environments. According to the openai/gym#834, gym MuJoCo v2 environments should be equivalent to the gym MuJoCo v1 environments.

Also note the performance of our ddpg_continuous_action.py seems to perform worse than the reference implementation on Walker2d and Hopper. This is likely due to openai/gym#938. We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated.

Learning curves:

Tracked experiments and game play videos:

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971. https://arxiv.org/abs/1509.02971 ↩↩↩↩↩↩↩↩
Fujimoto, S., Hoof, H.V., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ArXiv, abs/1802.09477. https://arxiv.org/abs/1802.09477 ↩↩↩↩