OBAC

Abstract

Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.

Policy Behavior

We visualize the behaviors of OBAC in different tasks to show its effectivenenss.

OBAC in Mujoco Benchmarks

Ant

HumanoidStandup

Swimmer

OBAC in DM Control Benchmarks

AcrobotSwingup

DogWalk

DogRun

HumanoidRun

Swimmer15

WalkerRun

OBAC in Meta-World Benchmarks

Assembly

Coffer Push

Disassemble

Hammer

Pick Place

Soccer

OBAC in Adroit Benchmarks

Baoding-4th

Door

Pen

OBAC in ManiSkill2 Benchmarks

PickYCB

StackCube

TurnFaucet

OBAC in Myosuite Benchmarks

myoHandKey

myoHandPose

myoHandReach

Benchmark Results

We evaluate OBAC across 53 continuous control tasks spanning 6 domains: Mujoco, DM Control, Meta-World, Adroit, Myosuite, and ManiSkill2. These tasks cover a wide range of challenges, including high-dimensional states and actions (up to $\mathcal{S}\in\mathbb{R}^{375}$ and $\mathcal{A}\in\mathbb{R}^{39}$), sparse rewards, multi-object and delicate manipulation, musculoskeletal control, and complex locomotion.

OBAC in Mujoco Benchmarks

We evaluate OBAC on 6 continuous control tasks in Mujoco suite.

OBAC in DM Control Benchmarks

We evaluate OBAC on 17 continuous control tasks in DM Control suite.

OBAC in Meta-World Benchmarks

We evaluate OBAC on 17 continuous control tasks in Meta-World suite.

OBAC in Adroit Benchmarks

We evaluate OBAC on 4 continuous control tasks in Adroit suite.

OBAC in ManiSkill2 Benchmarks

We evaluate OBAC on 5 continuous control tasks in ManiSkill2 suite.

OBAC in Myosuite Benchmark

We evaluate OBAC on 4 continuous control tasks in Myosuite suite.

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Abstract

Policy Behavior

OBAC in Mujoco Benchmarks

OBAC in DM Control Benchmarks

OBAC in Meta-World Benchmarks

OBAC in Adroit Benchmarks

OBAC in ManiSkill2 Benchmarks

OBAC in Myosuite Benchmarks

Benchmark Results

OBAC in Mujoco Benchmarks

OBAC in DM Control Benchmarks

OBAC in Meta-World Benchmarks

OBAC in Adroit Benchmarks

OBAC in ManiSkill2 Benchmarks

OBAC in Myosuite Benchmark

Offline-Boosted Actor-Critic:
Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL