AdaOr: Turning Edit Intensity into a Slider

/01

The "take it or leave it" problem

Modern video editing models are remarkably powerful. You type "turn the coat into a furry coat" and the model delivers. But here's the catch — there's no precise way to express how much to edit. "Make it a little furry" or "make it extremely furry" is a fuzzy and discrete choice, not a precise and continuous one.

Want to turn a sunny scene into a snowy one? Great. But what if you want just a light dusting of snow, not a full blizzard? Or what if you want to add a beard to a portrait, from light stubble to a full lumberjack? Today's models give you the lumberjack and nothing in between. Your only lever is rewording the prompt — but that gives you discrete, unpredictable jumps, not a smooth dial:

Input

"Make her slightly blonde"

"Make her blonde"

"Make her very blonde"

This feels like it should be a solved problem. After all, diffusion models already have a knob that controls how much the model follows your prompt: Classifier-Free Guidance (CFG). Turn CFG up and the model follows your instructions more closely. Turn it down and… well, that's where things get interesting.

/02

Why the obvious solution doesn't work

You might expect that lowering the CFG scale would smoothly reduce the edit , from full edit down to the original, untouched video. But it doesn't. Instead of getting "less editing," you get weird, arbitrary edits.

Here's the intuition. CFG works by combining two predictions: one with your prompt and one without. The formula is deceptively simple:

Classifier-Free Guidance output = ε(∅) + w · ( ε(prompt) − ε(∅) )

ε(∅) = prediction without prompt (the "origin")

ε(prompt) = prediction with your prompt

w = guidance scale (the "volume knob")

Read it this way: start at the origin (the no-prompt prediction), then steer toward the prompt by some amount w. When w is large, you get strong prompt adherence. When w approaches zero, the output collapses back to that origin.

In standard text-to-image generation, the origin ε(∅) represents "a random natural image." That works fine. But in editing models, the model is also conditioned on an input image, so the origin represents "a random edit of the input." It's not "no edit", it's "some arbitrary, uncontrolled edit."

So when you turn w down to get less editing, you're actually drifting toward a random edit, not toward the original image. The origin is wrong.

⊘

Standard CFG

Low CFG → High CFG

Low values produce arbitrary, random edits (gold face paint, headpiece), not the original.

◎

AdaOr (Ours)

Original → Full Edit

A smooth, continuous transition from the untouched face to the masquerade mask.

/03

A new origin: teaching the model to do nothing

Our key insight is simple but powerful: the guidance origin needs to change. Instead of starting from "random edit," we need to start from "no edit", the identity transformation.

Building on Lucy Edit, Decart's video editing model, we introduce a special instruction token, <id>, that we train the model to associate with doing nothing. When the model sees this token, it learns to reproduce the input exactly. This gives us a new direction in the diffusion space, one that represents faithful reconstruction.

With this identity direction in hand, we define Adaptive-Origin Guidance (AdaOr). Instead of using the unconditional prediction as a fixed origin, we interpolate between our identity prediction and the standard unconditional prediction, based on the desired edit strength α:

Adaptive Origin origin(α) = s(α) · ε(∅) + (1 − s(α)) · ε(⟨id⟩)

where s(α) = √α in practice.
At α = 0, the origin is pure identity → faithful reconstruction.
At α = 1, the origin is the standard unconditional → full editing power.

The beauty of this formulation is its boundary conditions. When edit strength is zero, the model faithfully reconstructs the input. When edit strength is one, it recovers the standard CFG behavior, the model's full editing capability, exactly as if AdaOr wasn't there. Everything in between is a smooth, continuous transition.

The full AdaOr guidance combines our adaptive origin with the standard CFG steering direction, scaled by the edit strength:

output = origin(α) + α · w · ( ε(prompt) − ε(∅) )

1

origin(α)

The adaptive origin. Interpolates between ε(⟨id⟩), the identity prediction (faithful reconstruction), and ε(∅), the standard unconditional prediction. At α = 0, it's pure identity. At α = 1, it recovers standard CFG.

2

α

The edit strength. Scales both the origin interpolation and the steering direction. At α = 0, there's no edit. At α = 1, you get the model's full editing power.

3

ε(prompt) − ε(∅)

The steering direction. Points from the unconditional prediction toward the prompt-conditioned prediction. Scaled by w (the CFG scale) and α (the edit strength).

Why not just replace the null with ⟨id⟩ everywhere?

A natural question: if ⟨id⟩ represents "no edit," why not simply swap it in for the unconditional term in standard CFG? It turns out this creates instability. The identity prediction tries to pull the latent back to the input image, and as the denoising process approaches its final steps (where noise levels drop to near zero), this pulling force diverges to infinity, causing the generation to blow up.

Our adaptive origin avoids this by gracefully transitioning away from the identity term as edit strength increases, ensuring stable behavior throughout the denoising process.

/04

Why you need both knobs

To really understand why AdaOr matters, let's look at what happens when you vary the CFG scale versus the edit strength α independently. The horizontal axis controls the CFG scale (0–5), while the vertical axis controls our edit strength α (0–1), which determines the origin used for guidance. Drag the knob to explore.

✎ "Change the apple to red and place it on a glass table"

← Alpha (Origin)

1 0

CFG AdaOr

0 CFG Scale → 5

Source

Output

CFG Scale 0.0

Alpha (Origin) 1.00

✕

CFG path (top edge): Origin stays at ε(∅). Low guidance produces arbitrary edits (swirly apple), not the source.

✓

AdaOr path (diagonal): Origin moves from ε(⟨id⟩) to ε(∅). Starts as the source, smoothly transitions to the full edit — both paths converge at the top-right.

/05

Try it yourself

Each example below has its own slider. Drag to explore how AdaOr smoothly controls the edit intensity across different prompts and content types.

"Add snow covering the ground, making it look like winter"

α = 0.00

"Add full thick long hair to the man"

α = 0.00

"Replace the dog with a lion"

α = 0.00

"Change the yellow dress to a pink kimono"

α = 0.00

"Replace the helmet with a carved pumpkin"

α = 0.00

"Transform the man into a medieval knight"

α = 0.00

/06

Editing should be expressive

Text-driven editing unlocked the ability to say what to change. AdaOr completes the picture by giving you control over how much. A simple identity token, a rethought guidance origin, and suddenly every edit becomes a smooth, explorable spectrum.

We believe this principle, teaching models explicit representations of "no change", has implications beyond editing. Anywhere diffusion models use guidance, there's an opportunity to build better, more controllable knobs.

Read the Paper