SteerFlow

visualizing how large language models steer through goal-space


Select model & goal-space

Don't see your favorite model? Request it here!

Goal-space flow visualizer

Waiting for selection.

Controls

no export in progress
What does this diagram show?

This vector flow diagram visualizes how language model outputs shift in goal-space: a space defined by measurable dimensions of text such as reading level, formality, and more.

More specifically, imagine a text-rewriting request:

  • "Make this more formal, but don't make it any longer"
  • "Make this a little easier to read"
Imagine these requests as "vectors" along the "formality" or "reading difficulty" dimensions, respectively. We gave an LLM multiple prompts similar to those shown above, and some texts to rewrite (white circles).

The LLM's output also lives in goal-space: it satisfies some desired level of formality or reading difficulty, which also "traces" out a vector.

Interpolating a bunch of these vectors gets us a flow diagram. We plot vectors where we ask for changes in the x-axis, but not the y-axis. Arrow direction indicates how the model's behavior changed, and the color shows the magnitude of change (from not a lot to a lot). Ideally, the vectors are short (mostly cool colors) or point along the x-axis, reflecting that the model generally moves in the direction that we asked for.

Learn more about how we measure steerability in our paper!

Steerability dashboard


Run info


Aggregate steerability metrics

What are these metrics?

This plot shows the three steerability metrics proposed in our steerability measurement framework. In summary, we propose a modeling user requests as multi-dimensional vectors in goal-space, and measuring steerability in terms of goal-space "distance."

For example, in text-rewriting tasks, we often ask for changes in multiple dimensions of text (e.g., "make this longer, but simplify the language"). When we ask an LLM to rewrite text in these ways, the model's output also changes text in multiple dimensions.

To evaluate performance, we need to take into account changes in all of these dimensions. In our work, we motivated three main steerability metrics. Informally:

  • Steering error (yellow): how far outputs deviate from the target,
  • Miscalibration (cyan): errors in the direction of intended movement, and
  • Orthogonality (magenta): errors outside the direction of intended movement.
Lower values are better across all three (minimum: 0). The box and violin shapes show distributional spread across our steerability probe (see panel), and the numbers on top indicate the median and interquartile range (IQR; 75th - 25th percentile).

For a more formal definition of these metrics, how they're motivated, and why they differ from how we evaluate LLM performance currently, check out our paper!