What are Large Language Models

What are Large Language Models

Introduction

Welcome to our introductory module on Large Language Models or LLMs.

LLMs, or Large Language Models, are a specific category of neural network models characterized by having an exceptionally high number of parameters, often in the billions. These parameters are essentially the variables within the model that allow it to process and generate text. They are trained on vast quantities of textual data, which provides them with a broad understanding of language patterns and structures. The main goal of LLMs is to comprehend and produce text that closely resembles human-written language, enabling them to capture the subtle complexities of both syntax (the arrangement of words in a sentence) and semantics (the meaning conveyed by those words).

These models undergo training with a simple objective: predicting the subsequent word in a sentence. However, they develop a range of emergent abilities during this training process. For example, they can perform tasks such as arithmetic calculations and word unscrambling and even achieve remarkable feats like successfully passing professional-level exams such as the US Medical Licensing Exam.

They generate text in an autoregressive manner, generating the next tokens one by one based on the tokens they have previously generated.

The attention mechanism plays a key role in enabling these models to establish connections between words and produce coherent and contextually relevant text.

LLMs have significantly advanced the natural language processing (NLP) field, revolutionizing our approach to tasks like machine translation, natural language generation, part-of-speech tagging, parsing, information retrieval, and more.

As we dive further into this module, we will explore the capabilities of these models, their practical applications, and their exciting future possibilities.

Language Modeling

Language modeling is a fundamental task in Natural Language Processing (NLP). It involves explicitly learning the probability distribution of the words in a language. This is generally learned by predicting the next token in a sequence. This task is typically approached using statistical methods or deep learning techniques.

LLMs are trained to predict the next token (word, punctuation, etc.) based on the previous tokens in the text. The models achieve this by learning the distribution of tokens in the training data.

Tokenization

The first step in this process is tokenization, where the input text is broken down into smaller units called tokens. Tokens can be as small as individual characters or as large as whole words. The choice of token size can significantly affect the model's performance. Some models even use subword tokenization, where words are broken down into smaller units that capture meaningful linguistic information.

For example, let’s consider the sentence "The child’s book.”

We could split the text whenever we find white space characters. The output would be:

["The", "child's", "book."]

As you can see, the punctuation is still attached to the words "child’s" and "book."

Otherwise, we could split the text according to white spaces and punctuation. The output would be:

["The", "child", "'", "s", "book", "."]

Importantly, tokenization is model-specific, meaning different models require different tokenization processes, which can complicate pre-processing and multi-modal modeling.

Model Architecture and Attention

The core of a language model is its architecture. Recurrent Neural Networks (RNNs) were traditionally used for this task, as they are capable of processing sequential data by maintaining an internal state that captures the information from previous tokens. However, they struggle with long sequences due to the vanishing gradient problem.

To overcome these limitations, transformer-based models have become the standard for language modeling tasks. These models use a mechanism called attention, which allows them to weigh the importance of different tokens when making predictions. This allows them to capture long-range dependencies between tokens and generate high-quality text.

Training

The model is trained on a large corpus of text to predict the next token of a sentence correctly. The goal is to adjust the model's parameters to maximize the probability of the observed data.

Typically a model is trained on a very large general dataset of texts from the Internet, such as The Pile or CommonCrawl. Sometimes also more specific datasets are used, such as the Stackoverflow Posts dataset.

The model learns to predict the next token in a sequence by adjusting its parameters to maximize the probability of outputting the correct next token from the training data.

Prediction

Once the model is trained, it can be used to generate text by predicting the next token in a sequence. This is done by feeding the sequence into the model, which outputs a probability distribution over the possible subsequent tokens. The next token is then chosen based on this distribution. This process can be repeated to generate sequences of arbitrary length.

Fine-Tuning

The model is often fine-tuned on a specific task after pre-training. This involves continuing the training process on a smaller, task-specific dataset. This allows the model to adapt its learned knowledge to the specific task (e.g. text translation) or specialized domain (e.g. biomedical, finance, etc), improving its performance.

This is a brief explanation, but the actual process can be much more complex, especially for state-of-the-art models like GPT-4. These models use advanced techniques and large amounts of data to achieve impressive results.

Context Size

The context size, or context window, in LLMs is the maximum number of tokens that the model can handle in one go. The context size is significant because it determines the length of the text that can be processed at once, which can impact the model's performance and the results it generates.

Different LLMs have different context sizes. For instance, the OpenAI “gpt-3.5-turbo-16k” model has a context window of 16,000 tokens. There is a natural limit to the number of tokens a model can produce. Smaller models can go up to 1k tokens, while larger models can go up to 32k tokens, like GPT-4.

Let’s Generate Some Text

Let’s try generating some text with LLMs. You must first generate an API key to use OpenAI’s models in your Python environment. You can follow the below steps to generate the API key:

  1. After creating an OpenAI account, log in.
  2. After logging in, choose Personal from the top-right menu, then choose “View API keys.”
  3. The “Create new secret key” button is on the page containing API keys once step 2 has been finished. Clicking on that generates a secret key. Save this because it will be required in further lessons.

After that, you can save your key in a .env file like this:

OPENAI_API_KEY="<YOUR-OPENAI-API-KEY>"

Every time you start a Python script with the following lines, your key will be loaded into an environment variable called OPENAI_API_KEY. This environment variable will then be used by the openai library whenever you want to generate text.

from dotenv import load_dotenv
load_dotenv()

We are now ready to generate some text! Here’s an example of it.

from dotenv import load_dotenv
load_dotenv()
import os
import openai

# English text to translate
english_text = "Hello, how are you?"

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f'Translate the following English text to French: "{english_text}"'}
  ],
)

print(response['choices'][0]['message']['content'])
Code to run
Bonjour, comment ça va?
The output

By using dotenv, you can safely store sensitive information, such as API keys, in a separate file and avoid accidentally exposing it in your code. This is particularly important when working with open-source projects or sharing your code with others, as it ensures that the sensitive information remains secure.

Few-Shot Learning

Few-shot learning in the context of LLMs refers to providing the model with a few examples before making predictions. These examples "teach" the model how to reason and act as "filters" to help the model search for relevant patterns in the dataset.

The idea of few-shot learning is fascinating as it suggests that the model can be quickly reprogrammed for new tasks. While LLMs like GPT3 excel at language modeling tasks like machine translation, they may struggle with more complex reasoning tasks.

The few-shot examples are helping the model search for relevant patterns in the dataset. The dataset, which is effectively compressed into the model's weights, can be searched for patterns that strongly respond to these provided examples. These patterns are then used to generate the model's output. The more examples provided, the more precise the output becomes.

Here’s an example of few-shot learning:

from dotenv import load_dotenv
load_dotenv()
import os
import openai

# Prompt for summarization
prompt = """
Describe the following movie using emojis.

{movie}: """

examples = [
	{ "input": "Titanic", "output": "🛳️🌊❤️🧊🎶🔥🚢💔👫💑" },
	{ "input": "The Matrix", "output": "🕶️💊💥👾🔮🌃👨🏻‍💻🔁🔓💪" }
]

movie = "Toy Story"
response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(movie=examples[0]["input"])},
        {"role": "assistant", "content": examples[0]["output"]},
        {"role": "user", "content": prompt.format(movie=examples[1]["input"])},
        {"role": "assistant", "content": examples[1]["output"]},
        {"role": "user", "content": prompt.format(movie=movie)},
  ]
)

print(response['choices'][0]['message']['content'])
Code to run
🧸🤠👦🧒🎢🌈🌟👫🚁👽🐶🚀
The output

Scaling Laws

Scaling laws refer to the relationship between the model's performance and factors such as the number of parameters, the size of the training dataset, the compute budget, and the network architecture. They were discovered after a lot of experiments and are described in the Chinchilla paper. These laws provide insights into how to allocate resources when training these models optimally.

The main elements characterizing a language model are:

  1. The number of parameters (N) reflects the model's capacity to learn from data. More parameters allow the model to capture complex patterns in the data.
  2. The size of the training dataset (D) is measured in the number of tokens (small pieces of text ranging from a few words to a single character).
  3. FLOPs (floating point operations per second) measure the compute budget used for training.

The researchers trained the Chinchilla model, which has 70B parameters, on 1.4 trillion tokens. This aligns with the rule of thumb proposed in the paper: for a model with X parameters, it is optimal to train it on approximately X * 20 tokens. For example, in the context of this rule, a model with 100 billion parameters would be optimally trained on approximately 2 trillion tokens.

Applying this rule, the Chinchilla model, though smaller, performed better than other LLMs. It showed gains in language modeling and task performance and needed less memory and computing power. You can read more about Chinchilla in its paper “Training Compute-Optimal Large Language Models”.

Emergent Abilities in LLMs

Emergent abilities in LLMs refer to the sudden appearance of new capabilities as the size of the model increases. These abilities, which include performing arithmetic, answering questions, summarizing passages, and more, are not explicitly trained in the model. Instead, they seem to arise spontaneously as the model scales, hence the term "emergent."

LLMs are probabilistic models that learn patterns in natural language. When these models are scaled up, they not only improve quantitatively in their ability to learn patterns, but they also exhibit qualitative changes in their behavior.

Traditionally, the models require task-specific fine-tuning and architectural modifications to perform specific tasks. However, when scaled, these models can perform these tasks without any architectural modifications or task-specific training. They can do this simply by phrasing the tasks in terms of natural language. This capability of LLMs to perform tasks without fine-tuning is remarkable in itself.

What's even more intriguing is how these abilities appear. As LLMs grow, they rapidly and unpredictably transition from near-zero to sometimes state-of-the-art performance. This phenomenon suggests that these abilities are emergent properties of the model's scale rather than being explicitly programmed into the model.

This concept of emergent abilities in LLMs has significant implications for the field of AI, as it suggests that scaling up models can lead to the spontaneous development of new capabilities.

Prompts

The text containing the instructions that we pass to LLMs is commonly known as prompts.

Prompts are instructions given to AI systems like OpenAI's GPT-3 and GPT-4, providing context to generate human-like text. The more detailed the prompt, the better the model's output.

Shorter, concise, and descriptive prompts tend to yield better results as they leave room for the LLM's creativity. Specific words or phrases can help narrow down potential outcomes and ensure relevant content generation.

Writing effective prompts requires a clear goal, simplicity, strategic use of keywords, and actionability. Testing the prompts before publishing ensures the output is relevant and error-free.

Here are some prompting tips:

  1. Use precise language when crafting a prompt – this will help ensure accuracy in the generated output:
    • Less Precise Prompt: "Write about dogs."
    • More Precise Prompt: "Write a 500-word informative article about the dietary needs of adult Golden Retrievers."
  2. Provide enough context around each prompt – this will give a better understanding of what kind of output should be produced:
    • Less Contextual Prompt: "Write a story."
    • More Contextual Prompt: "Write a short story set in Victorian England featuring a young detective solving his first major case."
  3. Test different variations of each prompt – this allows you to experiment with different approaches until you find one that works best:
    • Initial Prompt: "Write a blog post about the benefits of yoga."
    • Variation 1: "Compose a 1000-word blog post detailing the physical and mental benefits of regular yoga practice."
    • Variation 2: "Create an engaging blog post that highlights the top 10 benefits of incorporating yoga into daily routine."
  4. Review generated outputs before publishing them – while most automated systems produce accurate results, occasionally mistakes occur so it’s always wise to double-check everything before releasing any content into production environments:
    • Before Review: "Yoga is a great way to improve your flexibility and strength. It can also help reduce stress and improve mental clarity. However, it's important to remember that all yoga poses are suitable for everyone."
    • After Review (correcting inaccurate information): "Yoga is a great way to improve your flexibility and strength. It can also help reduce stress and improve mental clarity. However, it's important to remember that not all yoga poses are suitable for everyone. Always consult with a healthcare professional before starting any new exercise regimen."

Hallucinations and Biases in LLMs

The term hallucinations refers to instances where AI systems generate outputs, such as text or images, that don't align with real-world facts or inputs. For example, ChatGPT might generate a plausible-sounding answer to an entirely incorrect factual question.

Hallucinations in LLMs refer to instances where the model generates outputs that do not align with real-world facts or context. This can lead to the propagation of misinformation, especially in critical sectors like healthcare and education where the accuracy of information is of utmost importance. Similarly, bias in LLMs can result in outputs that favor certain perspectives over others, potentially leading to the reinforcement of harmful stereotypes and discrimination.

Consider an interaction where a user asks, "Who won the World Series in 2025?" If the LLM responds with, "The New York Yankees won the World Series in 2025," it's a clear case of hallucination. As of now (July 2023), the 2025 World Series hasn't taken place, so any claim about its outcome is a fabrication.

Bias in AI and LLMs is another significant issue. It refers to these models' inclination to favor specific outputs or decisions based on their training data. If the training data is predominantly from a specific region, the model might show a bias toward that region's language, culture, or perspectives. If the training data contains inherent biases, such as gender or racial bias, the AI system might produce skewed or discriminatory outputs.

For example, if a user asks an LLM, "Who is a nurse?" and it responds with, "She is a healthcare professional who cares for patients in a hospital,” it shows a gender bias. The model automatically associates nursing with women, which doesn't accurately reflect the reality where both men and women can be nurses.

Mitigating hallucinations and bias in AI systems involves refining model training, using verification techniques, and ensuring the training data is diverse and representative. Finding a balance between maximizing the model's potential and avoiding these issues remains challenging.

Interestingly, in creative domains like media and fiction writing, these "hallucinations" can be beneficial, enabling the generation of unique and innovative content.

The ultimate goal is to develop LLMs that are not only powerful and efficient but also reliable, fair, and trustworthy. By doing so, we can maximize the potential of LLMs while minimizing their risks, ensuring that the benefits of this technology are accessible to all.

Conclusion

In this introductory module, we explored the fascinating world of LLMs. These powerful models, trained on vast amounts of text data, can understand and generate human-like text. They're built on transformer architectures, allowing them to capture long-range dependencies in language and generate text in an autoregressive manner.

We covered the capabilities of LLMs, discussing their impact on the field of NLP. We've learned about few-shot learning, scaling laws, and the emergent abilities of these models.

We also acknowledged the challenges that come with these models, including hallucinations and biases, emphasizing the importance of mitigating these issues.

In the next lesson, we’ll see a timeline of machine learning models used for language modeling up to the beginning of Large Language Models.