When Life Gives You BC, Make Q-functions:
Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

1Robotics and AI Institute 2Brown University 3Northeastern University
Q2RL

Abstract

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self- guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a ditribution mismatch between offline data and online learning.

In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q values to collect samples for RL policy training.

Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on- robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy.


Q2RL: Two-Phase Framework

Phase 1

Q-Estimation

We estimate the Q-function of the pretrained BC policy by assuming its action distribution approximates a Boltzmann distribution. Using only the BC policy's action log-probabilities and entropy (no training data required) we derive:

$\hat{Q}_{\rm BC} = V_{\rm BC}(s) + \log\pi_{\rm BC}(a \mid s) + \mathcal{H}[\pi_{\rm BC}(\cdot \mid s)].$

The value function $V_{BC}$ is estimated via Monte Carlo returns from a small number of initial rollouts.

Phase 2

Q-Gating

During online RL, Q-Gating maintains two Q-functions: a frozen $\hat{Q}_{BC}$ (preserving BC performance) and a trainable $Q_{RL}$ (enabling improvement). At each step, the action with the higher Q-value is executed:

$a = \begin{cases} a_{\rm BC} \sim \pi_{\rm BC}(s),& \hat{Q}_{BC}(s, a_{\rm BC}) > Q_{RL}(s, a_{\rm RL}) \\ a_{\rm RL} \sim \pi_{\rm RL}(s_t),& \rm otherwise. \end{cases} $

This mechanism prevents catastrophic forgetting of good BC actions while allowing the RL policy to explore and improve in states where BC is suboptimal. An auxiliary BC loss further stabilizes training for safe on-robot deployment.

Q2RL Overview Figure

Q2RL consists of Q-Estimation and Q-Gating. Q-Estimation extracts a Q function from a BC Policy using its value function, action log-probabilities, and entropy. During RL policy training, Q-Gating selects and executes the BC or RL action with highest respective Q value, updating the RL policy on the collected interactions.

🎯

BC actions for shared behavior

Q2RL uses BC actions for behaviors common across task settings, such as moving between bins.

🔄

RL actions for shifted parts

RL takes over for grasping and placement when parts are in new locations outside the BC training distribution.

No offline data needed

Q2RL does not require access to the BC training demonstrations during the online RL phase.

On-Robot RL Results

We evaluate Q2RL on a Franka Panda arm with a Robotiq 2F-85 gripper, using workspace and wrist RGB cameras. All tasks involve contact-rich, high-precision manipulation.

Peg Insertion

We compare Q2RL against IBRL, a SOTA BC-to-RL method.

BC Policy70%
Q2RL (Ours)100% ✓
IBRL95%

All videos are shown at 1x speed.

Pipe Assembly

Prior SOTA methods such as IBRL fail entirely on this longer-horizon, contact-rich task. Q2RL learns to grasp, align, and insert within 2.5 hours.

BC Policy20%
Q2RL (Ours)75% ✓
IBRL0%

All videos are shown at 1x speed.

Adapting Beyond BC Training Distribution

The BC policy for Kitting was trained with only a single object in each bin. Q2RL adapts to the modified setting with two objects per bin - a task distribution the BC policy was never trained on.

BC policy achieves 95% success on Kitting-Original, but only 35% success on Kitting-Modified (two objects per bin). Q2RL recovers to 70% success on the harder modified task, adapting to the distribution shift through online RL.

BC Policy — Kitting Modified35% Success
Q2RL — Kitting Modified 70% Success
IBRL — Kitting Modified 0% Success

All videos are shown at 1x speed.

Overcoming BC Failure Modes

Q2RL strategically switches to RL actions to recover from common BC failure modes.

BC gets stuck → RL takes over
Using RL for regrasping

Safe Exploration via Q-Guided Policy

A core challenge in real-world RL is aggressive exploration that causes unsafe robot behavior. Q2RL enables safer exploration from the start by grounding policy improvement in estimated BC Q-values.

IBRL — Aggressive Exploration 2 Safety Violations
Q2RL — Safe Exploration 0 Safety Violations
Blue border = BC actions
Yellow border = RL actions

On-Robot RL Timelapse

Timelapse of learning pipe assembly using Q2RL, recorded every ten episodes (approx. 2hrs)

Blue border = BC actions
Yellow border = RL actions

Simulation Benchmark Results

Q2RL outperforms SOTA offline-to-online baselines on D4RL and robomimic benchmarks across state-based and image-based observations.

D4RL Results

Fig. 3: Results on D4RL (Kitchen, Pen, Door). Q2RL starts with strong initial performance from BC and continuously improves, outperforming WSRL, CalQL, CQL, and IBRL baselines.

Robomimic Results

Fig. 4: Results on robomimic (Lift, Can, Square) — both with and without access to offline training data. Q2RL is the only method that reliably learns without offline data, while remaining competitive with data access.

BibTeX

If you find Q2RL useful in your research, please cite our paper:

@inproceedings{dodeja2026q2rl,
  title     = {When Life Gives You BC, Make Q-functions:
               Extracting Q-values from Behavior Cloning
               for On-Robot Reinforcement Learning},
  author    = {Dodeja, Lakshita and Biza, Ondrej and Vats, Shivam and
               Hart, Stephen and Tellex, Stefanie and Walters, Robin and
               Schmeckpeper, Karl and Weng, Thomas},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026},
}