What is LLMOps

What is LLMOps

Introduction

As LLMs continue to revolutionize various applications, managing their lifecycle has become important. In this lesson, we will explore the concept of LLMOps, its origins, and its significance in today's AI industry. We will also discuss the steps involved in building an LLM-powered application, the differences between LLMOps and MLOps, and the challenges and solutions associated with each step.

The Emergence of LLMOps

In recent years, the world of AI has witnessed the rise of large language models. These models have billions of parameters and are trained on billions of words, hence the term "large.” The advent of LLMs has led to the emergence of a new term, LLMOps, which stands for Large Language Model Operations. This lesson aims to comprehensively understand LLMOps, its origins, and its significance in the AI industry.

LLMOps is essentially a set of tools and best practices designed to manage the GenAI lifecycle, from development and deployment to maintenance.

LLMOps have gained traction with the rise of LLMs, particularly after the release of OpenAI's ChatGPT, which led to a surge in LLM-powered applications, such as chatbots, writing assistants, and programming assistants.

However, the process of building production-ready LLM-powered applications presents unique challenges that differ from those encountered when building AI products with traditional machine learning models. This has necessitated the development of new tools and practices, giving birth to the term "LLMOps.”

Steps Involved in LLMOps and Differences with MLOps

While LLMOps can be considered a subset of MLOps (Machine Learning Operations), there are key differences between the two, primarily due to the differences in building AI products with classical ML models and LLMs.

The process of building an LLM-powered application involves several key steps.

1. Selection of a Foundation Model

Foundation models are pre-trained LLMs that can be adapted for various downstream tasks. Training these models from scratch is complex, time-consuming, and costly. Hence, developers usually opt for either proprietary models owned by large companies or open-source models hosted on community platforms like Hugging Face.

This differs from standard MLOps, where a model is typically trained from scratch with a smaller architectures or on different data, especially for tabular classification and regression tasks (except for computer vision, where most applications start with a model trained on general datasets like Imagenet or COCO). Typically a dataset is splitted into training and evaluation sets, where 70% of the data go into the training set, or other evaluation techniques like crossvalidation are used. When working with LLMs, this is not possible due to the high costs involved in the pretraining. MLOps models are data-hungry and require a lot (thousands at least) of labeled data to be trained on.

Consequently, choosing the suitable foundation model in LLMOps is very important, as crucial as choosing a proprietary or open-source foundation model. Proprietary LLMs are usually bigger and more performant than open-source alternatives (thanks to the money investments that large corporations can make) and may also be more cost-effective for the final user as there’s no need to set up an expensive infrastructure to host the model (which organizations can do efficiently, as they have many customers and amortize the costs). On the contrary, open-source models are generally more customizable and can be improved by anyone from the open-source community, indeed they soon matched the quality of many proprietary LLMs.

Another aspect to consider is the knowledge cutoff of LLMs: the date of the last published document on which the model was trained. For example, the model used in ChatGPT is currently limited to data up until September 2021. Consequently, the model can easily talk about everything that happened before that date but finds it hard to talk about later stuff. For example, ChatGPT doesn’t know about the latest startups or products released. Therefore, he may hallucinate when talking about them.

2. Adaptation to Downstream Tasks

After selecting a foundation model, it can be customized for specific tasks through techniques such as prompt engineering. This involves adjusting the input to produce the desired output.

It's important to keep track of the prompts used when using prompt engineering since they will likely be improved over time and can impact performance on specific tasks. By doing this, if a new prompt in production works worse than the previous one in some aspects and If we want to revert, it can be done easily.

Additionally, fine-tuning can be utilized to enhance the model's performance on a specific task, requiring a high-quality dataset for it (thus, involving a data collection step). In the case of fine-tuning, there are different approaches such as fine-tuning the model, fine-tuning the instructions, or using soft prompts. There are challenges with fine-tuning due to the large size of the model. Additionally, deploying the newly finetuned model on a new infrastructure can be difficult. To solve this problem, today, there are finetuning techniques that improve only a small subset of additional parameters to add to the existing foundational model, such as LoRA. Using LoRA, it’s possible to keep the same foundation model always deployed on the infrastructure while adding the additional finetuned parameters when needed. Recently, popular proprietary models like GPT3.5 and PaLM can now be finetuned easily directly on the company platform.

When fine-tuning a model, it's essential to keep track of the dataset used and the metrics achieved. It can be helpful to use a tool like Weights and Biases, which tracks experiments and provides a dashboard where you can monitor the metrics of your fine-tuned model on an evaluation set as it is trained. This provides insights into whether the training is progressing well or not. See this page to learn more about how W&B experiment tracking works. It will be used in the following lessons to train and fine-tune language models.

3. Evaluation

Evaluating the performance of an LLM is more complex than evaluating traditional ML models. The main reason for this is that the output of an LLM is usually free text, and it’s harder to devise metrics that can be computed via code and that work well on free text. For example, try thinking about how you could evaluate the quality of an answer given by an LLM assistant whose job is to summarize YouTube videos, for which you don’t have reference summaries written by humans. Currently, organizations often resort to A/B testing to assess the effectiveness of their models, checking whether the user’s satisfaction is the same or better after the change in production.

Another aspect to consider is hallucinations. How can we measure, with a metric implemented in code, whether the answer of our LLM assistant contains hallucinations? This is another open challenge where organizations mainly rely on A/B testing.

4. Deployment and Monitoring

Deploying and monitoring LLMs is very important as their completions can change significantly between releases. Tools for monitoring LLMs are emerging to address this need.

Another concern about LLMOps is the latency of the model. Indeed, since the model is autoregressive (i.e., produces the output one token at a time), it may take some time to output a complete paragraph. This is in contrast with the most popular applications of LLMs, which want them as assistants, which, therefore, should be able to output text at a throughput similar to a user's reading speed.

One of the emerging tools in the LLMOps landscape is W&B Prompts, a suite designed specifically for the development of LLM-powered applications. W&B Prompts offers a comprehensive set of features that allow developers to visualize and inspect the execution flow of LLMs, analyze the inputs and outputs, view intermediate results, and securely manage prompts and LLM chain configurations.

A key component of W&B Prompts is Trace, a tool that tracks and visualizes the inputs, outputs, execution flow, and model architecture of LLM chains. Trace is particularly useful for LLM chaining, plug-in, or pipelining use cases. It provides a Trace Table for an overview of the inputs and outputs of a chain, a Trace Timeline that displays the execution flow of the chain color-coded according to component types, and a Model Architecture view that provides details about the structure of the chain and the parameters used to initialize each component.

LLMOps is a rapidly evolving field, and it's hard to predict its future trajectory. However, it's clear that as LLMs become more prevalent, so will the tools and practices associated with LLMOps. The rise of LLMs and LLMOps signifies a major shift in building and maintaining AI-powered products.

Conclusion

In conclusion, LLMOps, or Large Language Model Operations, is a critical aspect of managing the lifecycle of applications powered by LLMs. This lesson has provided an overview of the origins and significance of LLMOps, the steps involved in building an LLM-powered application, and the differences between LLMOps and MLOps.

We studied the process of selecting a foundation model, adapting it to downstream tasks, evaluating its performance, and deploying and monitoring the model. We've also highlighted the unique challenges posed by LLMs, such as the complexity of evaluating free text outputs and the need for prompt versioning and efficient deployment strategies.

The emergence of tools like W&B Prompts and practices like A/B testing are indicative of the rapid evolution of LLMOps. As LLMs continue to revolutionize various applications, the tools and practices associated with LLMOps will undoubtedly become increasingly important in AI.