Streaming Flow Policy

Why streaming flow policy?

Left: Conventional diffusion/flow policy are a "trajectory of trajectories"

Diffusion policy and flow-matching policy input a history of observations (not shown) to predict a "chunk" of future robot actions. The $x$-axis represents the robot action space, and the $y$-axis represents increasing diffusion timesteps. Conventional diffusion/flow policies sample a "trajectory of trajectories" i.e. a diffusion/flow trajectory of action trajectories. This framework discards all intermediate action trajectories that are computed, and must wait for the diffusion/flow process to complete before any actions can be executed on the robot.

Right: Streaming flow policy treats action trajectories as flow trajectories

We introduce an imitation learning framework that simplifies diffusion/flow policies by treating "action trajectories as flow trajectories". We develop a novel flow-matching algorithm in action space, instead of action trajectory space. First, we sample from a narrow Gaussian centered at the most recently generated/executed action. Then, we iteratively sample a sequence of actions that constitutes a single action trajectory. This aligns the "diffusion time" of the flow sampling process with the "execution time" of the action trajectory. Importantly, the computed actions can be streamed to the robot's actuators on the fly during the flow sampling process, enabling significantly faster and reactive policies.

Constructing conditional flows

To illustrate our method, consider a toy example shown in (a) of 1-D actions ($x$-axis) with two demonstration trajectories shown in blue and red. The $y$-axis represents time. Given a demonstration trajectory from the training set (e.g., the blue one), we first construct a conditional flow in (b) that samples trajectories from a thin Gaussian tube around the demonstration. Let us represent the trajectory by $\xi: [0, 1] \rightarrow \mathcal{A}$. We construct a velocity field $v_\xi(a, t \,|\, h)$ as follows:

$$ \begin{align} p_\xi^0(a) \hspace{0.5em} &:= \ \mathcal{N}\left(a ~\big\vert~ \xi(0) \,,\, \sigma_0^2\right)\\ v_\xi(a, t) \hspace{0.5em} &:= \hspace{-0.5em}\underbrace{\ \dot{\xi}(t)\ }_{\hspace{-5em}\text{Demonstration velocity}} \hspace{-0.5em} - \underbrace{k(a - \xi(t))}_{\text{Stabilization term}} \end{align} $$

The initial distribution $p_\xi^0(a)$ is a narrow Gaussian centered at the starting action $\xi(0)$ of the demonstration with a small standard deviation $\sigma_0$. The velocity field $v_\xi(a, t \,|\, h)$ is a combination of two terms. The demonstration velocity is the time derivative of the demonstration trajectory at time $t$; this term serves to flow in the direction of the demonstration. The stabilization term is a negative proportional error feedback that corrects deviations and guides back to the demonstration trajectory. Block et. al. have shown that controllers that stabilize around demonstration trajectories reduce distribution shift and improve theoretical imitation learning guarantees.

Learning marginal flows

Then, we learn a marginal velocity field that expresses multi-modal trajectories, as shown in (c). The marginal distribution over actions at each time slice induced by the learned velocity field matches the training distribution! The sampled trajectories shown in (e) indeed bifurcate and cover both modes of the demonstrated data.

How does this magic work? The answer is flow matching!

Let $p_\mathcal{D}(h, \xi)$ denote the training distribution over histories $h$ and future action trajectories $\xi$. Using the constructed conditional velocity fields $v_\xi(a, t)$ as target, we train a neural network $v_\theta(a, t \mid h)$ using a finite-sample estimate of the conditional flow matching loss: \begin{align} \mathcal{L}_{\text{CFM}}(v_\theta, p_\mathcal{D}) := ~ &\mathbb{E}_{(h, \xi) \sim p_\mathcal{D}}~\mathbb{E}_{t \sim {U}[0,1]}~\mathbb{E}_{a \sim p_\xi\left(a \mid t\right)} ~ \big\|v_\theta(a, t \mid h) - v_\xi(a, t)\big\|_2^2 \end{align} This is simply an $L_2$ loss between the predicted velocity field $v_\theta(a, t \mid h)$ and the conditional velocity field $v_\xi(a, t)$ as target. The expectation is over histories and trajectories in the training distribution $p_\mathcal{D}(h, \xi)$, time $t$ sampled uniformly from $[0, 1]$, and action $a$ sampled from the constructed conditional flow. The flow matching theorem states that if the velocity network is trained well, then the per-timestep marginal action distribution induced by learned flow matches the average of per-timestep marginal distributions of constructed conditional flows $p_\xi(a \mid t)$, over the distribution of future trajectories in the training set $p_\mathcal{D}(\xi \mid h)$ that share the given observation history $h$. \begin{align} p^*(a \mid t, h) % &= \mathbb{E}_{\xi \sim p_\mathcal{D}(\xi \mid h)}\big[p_\xi(a \mid t)\big] &= \int_\xi p_\xi(a \mid t)\,p_\mathcal{D}(\xi \mid h) \, d\xi \label{eq:averaged-marginal-distributions} \end{align} Streaming flow policy is able to both ✅ represent multi-modal distributions over action trajectories like diffusion/flow policies, while also being ✅ able to stream actions during the flow sampling process.

Multimodal sampling in Push-T

Diffusion policy [1]

Streaming flow policy

We trained a streaming flow policy on the Push-T dataset [1], and visualize predicted action chunks from various configurations. (a) Chi et. al. [1] report that a diffusion policy trained on the Push-T dataset samples multimodal action trajectories when starting from initial configurations that are symmetric with respect to block, goal pose and the agent (pusher). (b) We find the same to hold true for a streaming flow policy with the same architecture trained on the Push-T dataset. Going a step further, we visualize sampled action trajectories from asymmetric configurations, as shown in (c, d). We find that streaming flow policy produces highly multimodal trajectories here as well, capturing diverse and valid behaviors to push the block towards the goal pose.

Decoupling stochasticity via latent variables

Constructing a conditional flow using stochastic latent variables instead of adding noise to actions.

In this toy example, the $x$-axis represents a 1-D action space, and the $y$-axis represents action execution time, which is equal to the flow progression time. (a) The toy bi-modal training set, same as the previous example, contains two trajectories shown in red and blue. Given a demonstration trajectory from the training set (e.g., the demonstration in blue), we design a velocity field $v(a, z, t \mid h)$ that takes as input time $t \in [0, 1]$, the action $a$ at time $t$, as well as an additional latent variable $z \in \mathcal{A}$. The latent variable is responsible for injecting noise into the flow sampling process, allowing the initial action $a(0)$ to be deterministically set to the initial action of the demonstration. The latent variable $z(0) \sim \mathcal{N}(0, 1)$ is sampled from the standard normal distribution at the beginning of the flow process, similar to conventional diffusion/flow policies. The velocity field $v(a, z, t \mid h)$ generates trajectories in an extended sample space $[0, 1] \rightarrow \mathcal{A}^2$ where $a$ and $z$ are correlated and co-evolve with time. (b) Shows the marginal distribution of actions $a(t)$, and (c) shows the marginal distributions of the latent variable $z(t)$, at each time step. Overlaid in red are the $a$- and $z$- projections, respectively, of trajectories sampled from the velocity field. The action's trajectory evolves in a narrow Gaussian tube around the demonstration, while the latent variable's trajectory starts from $\mathcal{N}(0, 1)$ at $t=0$ and converges to the demonstration trajectory at $t=1$; see the paper for a full description of the velocity field.

Learned marginal velocity field under the decoupled auxiliary stochastic latent variables.

The marginal velocity flow field $v_{\theta}(a, z, t \mid h)$ learned using the flow construction above. (a, b) shows the marginal distribution of actions $a(t)$ and the latent variable $z(t)$, respectively, at each time step under the learned velocity field. (c, d) Shows the $a$- and $z$- projections, respectively, of trajectories sampled from the learned velocity field. By construction, $a(0)$ deterministically starts from the most recently executed action, whereas $z(0)$ is sampled from $\mathcal{N}(0, 1)$. Trajectories starting with $z(0) < 0$ are shown in blue, and those with $z(0) > 0$ are shown in red.

The main takeaway is that in (c), even though all samples deterministically start from the same initial action (i.e., the most recently executed action), they evolve in a stochastic manner that covers both modes of the training distribution. This is possible because the stochastic latent variable $z$ is correlated with $a$, and the initial random sample $z(0) \sim \mathcal{N}(0, 1)$ informs the direction $a$ evolves in.

Caveats

Streaming flow policy only matches per-timestep marginal distributions, not necessarily the joint distribution

A toy example illustrating how streaming flow policy matches marginal distribution of actions in the trajectory at all time steps, but not necessarily their joint distribution. The $x$-axis represents a 1-D action space, and the $y$-axis represents action execution time, which is equal to the flow progression time. (a) The bi-modal training set contains two intersecting demonstration trajectories, illustrated in blue and red, with shapes "S" and "Ƨ" respectively. (b) The learned velocity flow field $v_{\theta}(a, t \mid h)$ and the induced marginal action distribution at each time step. The marginal distributions perfectly match the training data. (c) Trajectories sampled from the learned velocity field. Trajectories that start from $a < 0$ are shown in blue, and those starting from $a > 0$ are shown in red. The sampled trajectories have shapes "Ɛ" and "3", with equal probability. These shapes are different from the shapes "S" and "Ƨ" in the training distribution, although their per-timestep marginal distributions are identical.

Streaming flow policy only matches per-timestep marginal distributions, not necessarily the joint distribution.
Samples from the deterministic variant of streaming flow policy.

Different variants of streaming flow policy can produce different joint distributions over actions that are consistent with the per-timestep marginal distributions in the training data. This example is produced using the stochastic version of streaming flow policy. (a) The marginal action distribution at each time step learned by the stochastic streaming flow policy matches the training data. (b) Samples from the trained policy produces four modes with shapes "Ƨ", "S", "Ɛ" and "3", whereas the training data contains only two modes with shapes "S" and "Ƨ". Due to stochasticity introduced by decoupled latent variables, trajectories in blue (alternatively, red) that start from $a < 0$ (alternatively, $a > 0$) are able to split in either direction at the point of intersection.

While streaming flow policy cannot capture global constraints across timesteps, it can still capture local constraints such as arbitrary positions constraints and convex velocity constraints; see the paper for more details.