Training on Generated Data and Model Collapse

Introduction

In this lesson, we will examine the phenomenon of model collapse, its stages, its causes, and its implications on the future of Large Language Models. We also draw parallels with related concepts in machine learning, such as catastrophic forgetting and data poisoning. Finally, we contemplate the value of human-generated content in the era of dominant LLMs and the potential risk of widespread model collapse.

Understanding Model Collapse

Model collapse, defined in the paper “The Curse of Recursion: Training on Generated Data Makes Models Forget,” is a degenerative process affecting generations of learned generative models. It occurs when the data generated by a model ends up contaminating the training set of subsequent models. As a result, these models start to misinterpret reality, reinforcing their own beliefs instead of learning from real data.

Here’s an image exemplifying model collapse.

Image from the paper “The Curse of Recursion: Training on Generated Data Makes Models Forget.” Model Collapse refers to a degenerative learning process where models start forgetting improbable events over time as the model becomes poisoned with its projection of reality.

There are two distinct stages of model collapse: early and late.

In the early stage, the model begins to lose information about the tails of the distribution.
As the process progresses to the late stage, the model starts to entangle different modes of the original distributions, eventually converging to a distribution that bears little resemblance to the original one, often with very small variance.

Here’s an example of text outputs from sequential generations of 125M parameters LLMs where each generation is trained on data produced by the previous generation.

Image from the paper “The Curse of Recursion: Training on Generated Data Makes Models Forget.”

Related Work on Model Collapse

Model collapse shares similarities with two concepts in machine learning literature: catastrophic forgetting and data poisoning.

Catastrophic forgetting, a challenge in continual learning, refers to the model's tendency to forget previous samples when learning new information. This is particularly relevant in task-free continual learning, where data distributions gradually change without the notion of separate tasks. However, in the context of model collapse, the changed data distributions arise from the model itself as a result of training in the previous iteration.
On the other hand, data poisoning involves the insertion of malicious data during training to degrade the model’s performance. This concept becomes increasingly relevant with the rise of contrastive learning and LLMs trained on untrustworthy web sources.

Yet, neither catastrophic forgetting nor data poisoning fully explain the phenomenon of model collapse, as they don't account for the self-reinforcing distortions of reality seen in model collapse. However, understanding these related concepts can provide additional insights into model collapse mechanisms and potential mitigation strategies.

Causes of Model Collapse

Model collapse primarily results from two types of errors: statistical approximation error and functional approximation error.

The statistical approximation error is the primary cause. It arises due to the finite number of samples used in training. Despite using a large number of points, significant errors can still occur. This is because there's always a non-zero probability that information can get lost at every step of re-sampling.
The functional approximation error is a secondary cause. It stems from the limitations of our function approximators. Even though neural networks are theoretically capable of approximating any function, in practice, they can introduce non-zero likelihood outside the support of the original distribution, leading to errors.

The Future of the Web with Dominant LLMs

As LLMs become more prevalent in the online text and image ecosystem, they will inevitably train on data produced by their predecessors. This could lead to a cycle where each model generation learns more from previous models' output and less from original human-generated content. The result is a risk of widespread model collapse, with models progressively losing touch with the true underlying data distribution.

The model collapse has far-reaching implications. As models start to misinterpret reality, generated content quality could degrade over time. This could profoundly affect various LLMs' applications, from content creation to decision-making systems.

The Value of Human-Generated Content

In the face of model collapse, preserving and accessing data collected from genuine human interactions becomes increasingly valuable. Real human-produced data provides access to the original data distribution, which is crucial in learning where the tails of the underlying distribution matter. As LLMs increasingly generate online content, data from human interactions with these models will become an increasingly valuable resource for training future models.

Conclusion

In this lesson, we've explored the phenomenon of model collapse, a degenerative process that can affect generative models when they are trained on data produced by other models.

We've examined the stages of model collapse, from the early loss of information about the tails of distribution to the late-stage entanglement of different modes. We've drawn parallels with related concepts in machine learning: catastrophic forgetting and data poisoning. We also dissected the leading causes of model collapse, namely statistical and functional approximation errors.

As Large Language Models become more dominant in the digital landscape, the risk of widespread model collapse increases, potentially leading to a degradation in generated content quality. In this context, we've underscored the importance of preserving and accessing human-generated content, which provides a crucial link to the original data distribution and serves as a valuable resource for training future models.

As we continue to harness the power of LLMs, understanding and mitigating model collapse will be essential in ensuring the quality and reliability of their outputs.