Controlling LLM Outputs

Controlling LLM Outputs

Introduction

In this lesson, we will look into various methods and parameters that can be used to control the outputs of Large Language Models. We will discuss different decoding strategies and how they influence the generation process. We will also explore how certain parameters can be adjusted to fine-tune the output.

Decoding Methods

Decoding methods are fundamental strategies used by LLMs to generate text. Each method has its unique advantages and limitations.

At each decoding step, the LLM gives a score to each of its vocabulary tokens. A high score is related to a high probability of that token being the next token, according to the patterns learned by the model during training.

However, is the token with the highest probability always the best token to predict? By predicting the best token at step 1, the model may then find only tokens with low probabilities at step 2, thus having a low joint probability of the two consecutive tokens. Instead, predicting a slightly lower token at step 1 leads to a high probability token at step 2, thus having an overall higher joint probability of the tokens. Ideally, we’d want to do these computations for all the tokens in the model vocabulary and a large number of steps. However, this can’t be done in practice because it would require heavy computations.

All the decoding methods in this lesson try to find the right balance between:

  • Being “greedy” and instantly selecting the next token with higher probability.
  • A bit of exploration and trying to predict more tokens at once.

Greedy Search

Greedy Search is the simplest of all the decoding methods.

With Greedy Search, the model selects the token with the highest probability as its next output token. While this method is computationally efficient, it can often result in repetitive or less optimal responses due to its focus on immediate reward rather than long-term outcomes.

Sampling

Sampling introduces randomness into the text generation process, where the model randomly selects the next word based on its probability distribution. This method allows for more diverse and varied output but can sometimes produce less coherent or logical text.

Beam Search

Beam Search is a more sophisticated method. It selects the top N (with N being a parameter) candidate subsequent tokens with the highest probabilities at each step, but only up to a certain number of steps. In the end, the model generates the sequence of tokens (i.e., the beam) with the highest joint probability.

This significantly reduces the search space and produces more consistent results. However, this method might be slower and lead to suboptimal outputs as it can miss high-probability words hidden behind a low-probability word.

Top-K Sampling

Top-K Sampling is a variant of the sampling method where the model narrows down the sampling pool to the top K (with K being a parameter) of the most probable words. This method provides a balance between diversity and relevance by limiting the sampling space, thus offering more control over the generated text.

Top-p (Nucleus) Sampling

Top-p, or Nucleus Sampling, selects words from the smallest possible set of tokens whose cumulative probability exceeds a certain threshold P (with P being a parameter). This method offers fine-grained control and avoids the inclusion of rare or low-probability tokens. However, the dynamically determined shortlist sizes can sometimes be a limitation.

Parameters That Influence Text Generation

Apart from the decoding methods, several parameters can be adjusted to influence text generation using LLMs. These include temperature, stop sequences, frequency, and presence penalties.

These parameters can be adjusted with the most popular LLM APIs and Hugging Face models.

Temperature

The temperature parameter influences the randomness or determinism of the generated text. A lower value makes the output more deterministic and focused, while a higher value increases the randomness, leading to more diverse outputs.

It controls the randomness of predictions by scaling the logits before applying softmax during the text generation process. It's a crucial factor in the trade-off between diversity and quality of the generated text.

Here's a more technical explanation:

  1. Logits: When a language model makes a prediction, it generates a vector of logits, one for the next possible token. These logits represent the raw, unnormalized prediction scores for each token.
  2. Softmax: The softmax function is applied to these logits to convert them into probabilities. The softmax function also ensures that these probabilities sum up to 1.
  3. Temperature: The temperature parameter is used to control the randomness of the model's output. It does this by dividing the logits by the temperature value before the softmax step.
    • High Temperature (e.g., > 1): The logits are scaled down, which makes the softmax output more uniform. This means the model is more likely to pick less likely words, resulting in more diverse and "creative" outputs, but potentially with more mistakes or nonsensical phrases.
    • Low Temperature (e.g., < 1): The logits are scaled up, which makes the softmax output more peaked. This means the model is more likely to pick the most likely word. The output will be more focused and conservative, sticking closer to the most probable outputs but potentially less diverse.
    • Temperature = 1: The logits are not scaled, preserving the original probabilities. This is a kind of "neutral" setting.

In summary, the temperature parameter is a knob for controlling the trade-off between diversity (high temperature) and accuracy (low temperature) in the generated text.

Stop Sequences

Stop sequences are specific sets of character sequences that halt the text generation process once they appear in the output. They offer a way to guide the length and structure of the generated text, providing a form of control over the output.

Frequency and Presence Penalties

Frequency and presence penalties are used to discourage or encourage the repetition of certain words in the generated text. A frequency penalty reduces the likelihood of the model repeating tokens that have appeared frequently, while a presence penalty discourages the model from repeating any token that has already appeared in the generated text.

Conclusion

This lesson provided an overview of the various decoding methods and parameters that can be used to control the outputs of Large Language Models.

We've explored decoding strategies such as Greedy Search, Sampling, Beam Search, Top-K Sampling, and Top-p (Nucleus) Sampling, each with its unique approach to balancing the trade-off between immediate reward and long-term outcomes.

We've also discussed parameters like temperature, stop sequences, and frequency and presence penalties, which offer additional control over text generation.

Adjusting these parameters can help in guiding the model to produce the desired results, whether deterministic, focused outputs or more diverse, creative ones.