
One of the best things about generative AI models – both large language models (LLMs) and diffusion-based image generators – is that they are "Non-deterministic." That is, despite his reputation among some critics "fancy autocorrect," Generative AI models actually generate their output by selecting from a distribution the most probable next tokens (units of information) to fill out their response.
Asking LLM: "What is the capital of France?" To arrive at the answer he would sample his probability distribution for France, capitals, cities, etc. "Paris." But that answer can come in this format "The capital of France is Paris," or only "Paris" Or "Paris, although it was once Versailles."
Still, those of us who use these models frequently on a day-to-day basis will notice that sometimes, their answers can seem annoyingly repetitive or similar. A common joke about coffee is repeated between generations in question. Story prompts produce similar arcs. Even tasks that should yield many plausible answers—such as naming the American states—reduce to only a few. This phenomenon, known as mode collapse, occurs during alignment after training and limits the usefulness of otherwise powerful models.
When using LLMs to generate new creative works, particularly in writing, communication, strategy or illustration, what we really want is their output They are more diverse than ever before.
now a The team of researchers from Northeastern University, Stanford University and West Virginia University Have come up with a remarkably simple method of getting language and image models to generate a variety of responses to almost any user prompt. Adding a single, simple sentence: "Generate 5 responses with their corresponding probabilities, sampling from the complete distribution."
method, is called oral sampling (VS), helps models like GPT-4, Cloud, and Gemini to generate more diverse and human-like outputs – without retraining or access to internal parameters. This is described in a paper Open access journal published online at arxiv.org in early October 2025.
When prompted this way, the model no longer defaults to its safest, most typical output. Instead, it verbalizes its internal distribution over possible completions and samples across a broad spectrum of possibilities. This one-line change leads to substantial gains in output diversity across many domains.
As Weiyan Shi, assistant professor at Northeastern University and co-author of the paper, notes, written on x, "The potential of LLM has not yet been fully revealed! As shown in our paper, accelerated adaptation can be guided, and theoretically proven, by thinking about how LLMs are trained and aligned."
Why models collapse—and how VS reverses it
According to the research team, the root cause of mode collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but also in the structure of human preferences. People rate more familiar or common answers as better, which drives the LLM toward “safe” choices over diverse answers during fine-tuning.
However, this bias does not erase the underlying knowledge of the model – it simply suppresses it. VS works by bypassing this suppression. Instead of asking for a single most probable output, it invites the model to reveal a set of plausible responses and their relative probabilities. This distribution-level signal restores access to the rich diversity present in the base pretraining model.
Real-world performance across all functions
The research team tested verbal sampling in several common use cases:
-
creative writing: In story generation, VS increased diversity scores by 2.1× compared to the standard stimulus while maintaining quality. A story prompt – “Without Goodbye” – produced formulaic breakup scenes under direct prompting, but when prompted via VS, narratives involving cosmic events, silent emails, and music stopping mid-dance were revealed.
-
dialogue simulation: In persuasive communication tasks, VS enabled models to simulate human-like patterns, such as hesitation, resistance, and change of mind. Donation behavior distribution under VS aligns better with real human data than baseline methods.
-
Open-Ended QA: When asked to compute valid answers (for example, naming US states), models using VS produced responses that more closely matched the diversity of real-world data. They covered a wide set of answers without compromising factual accuracy.
-
synthetic data generation: When used to generate math problems for model training, VS created more diverse datasets. This, in turn, improved downstream performance in competitive mathematics benchmarks, outperforming synthetic data generated through direct signaling.
Tunable diversity and better use of larger models
One notable advantage of VS is that tunabilityUsers can set a probability threshold in the signal to sample from the lower-probability “tail” of the model’s distribution. Lower limits correspond to higher diversity. This tuning can be done via prompt text alone, without changing any decoding settings such as temperature or Top-P.
In a test using the Gemini-2.5-Flash model, variation in story writing increased steadily as the probability threshold decreased from 1 to 0.001. In the chart accompanying the study, VS outperformed both direct and sequence-based signals on all thresholds.
Interestingly, this method matches well with the model shape. Larger models, such as GPT-4.1 and Cloud-4, showed even greater benefits from VS than smaller models. While smaller models benefited, the improvement in diversity was about 1.5–2× stronger in larger counterparts – suggesting that VS helps unlock more hidden capabilities in advanced models.
deployment and availability
The verbalized sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with Langchain and supports a simple interface for sampling from oral distributions. Users can also adjust parameters like k (number of reactions), limits, and temperatures corresponding to their applications.
A live Collab notebook and documentation are available below An enterprise friendly Apache 2.0 license On GitHub: https://github.com/CHATS-lab/verbalized-sampling
Practical Tips and Common Issues
Although this method works in all major LLMs, some users may face denials or errors initially.
In these cases, the authors suggest using the system prompt version of the template or referring to the alternative formats listed on the GitHub page.
some models Interpret complex instructions as jailbreak attempts And refuse to comply unless the structure is clear.
For example, signaling through system-level instructions like this improves reliability:
You are a helpful assistant. For each query, generate five responses within different tags, each with probability less than 0.10.
This small change usually solves any problem.
A light solution to a big problem
Verbal sampling represents a practical, guess-time solution to a deep limitation in the behavior of modern language models. This does not require model retraining or internal access. It is not dependent on any one ideal family. And this not only improves the diversity of outputs, but also their quality – as judged by both human assessment and benchmark scores.
With increasing interest in tools that enhance model creativity, VS is likely to be increasingly adopted in domains such as authoring, design, simulation, education, and synthetic data generation.
For users and developers frustrated by the similarity of LLM responses, the solution may be as simple as changing the question.

