Open-Source LLMs

Introduction

In this lesson, we will discuss several open-source LLMs and their features, capabilities, and licenses. This overview will cover LLaMA 2, Open Assistant, Dolly (by Databricks), and Falcon as the most used LLMs. We will also explore the licenses and potential commercial usage of these models. Additionally, we will discuss limitations or restrictions that may be present in their licenses.

LLaMA 2

LLaMA 2 is a cutting-edge large language model developed by Meta, released on July 18, 2023, with an open license for both research and commercial use.

The architecture of LLaMA 2 is described in great detail in the 77-page paper, making it easier for data scientists to recreate and fine-tune the models for their specific needs. The model's training data comprises an impressive 2 trillion tokens. It has been trained on a massive scale, outperforming all open-source benchmarks and demonstrating performance comparable to GPT3.5 in terms of human evaluation.

LLaMA 2 is available in three parameter variations: 7B, 13B, and 70B, and there are also instruction-tuned versions known as LLaMA-Chat.

The fine-tuning process is done through Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF), using a novel approach to segment data based on helpfulness and safety prompts.

The reward models are crucial to LLaMA 2's performance, allowing it to balance safety and helpfulness effectively. The safety reward model and helpfulness reward model are trained to evaluate the quality of generated responses.

The impact of LLaMA 2 in Generative AI is substantial, outperforming other open innovation models like Falcon or Vicuna.

You can find the LLaMA 2 models on the Hugging Face Hub here. Here, we test the meta-llama/Llama-2-7b-chat-hf model. For this, you’ll first have to request access to the model on this page.

First, let’s download the model. It takes some time as the model weighs about 14GB.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# download model
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
)

Then, we generate a completion with it. This step will take a lot of time if you’re generating text using CPUs instead of GPUs!

# generate answer
prompt = "Translate English to French: Configuration files are easy to use!"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=100)

# print answer
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

Falcon

The Falcon models, developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi, have gained significant attention since their release in May 2023. These models are causal large language models (LLM), similar to GPT, and are also known as "decoder-only" models. They excel in predicting the next token in a sequence of tokens with their attention focused solely on the left context during training, while the right context remains masked.

The Falcon models are distributed under the Apache 2.0 License, allowing even commercial use.

The largest of these models, Falcon-40B, has shown great performance, outperforming other causal LLMs like LLaMa-65B and MPT-7B. Falcon-7B, a slightly smaller version, was designed to be fine-tuned on consumer hardware and has half the number of layers and embedding dimensions compared to Falcon-40B.

The training data for Falcon models primarily comes from the “Falcon RefinedWeb dataset,” which is meticulously curated and multimodal-friendly, preserving links and alt texts of images. This dataset and curated corpora make up 75% of the pre-training data for the Falcon models. While it primarily covers English, additional versions like "RefinedWeb-Europe" have been prepared to include several European languages.

The instruct versions of Falcon-40B and Falcon-7B perform even better, with fine-tuning done on a mixture of chat/instruct datasets sourced from various places, including GPT4all and GPTeacher.

You can find the Falcon models on the Hugging Face Hub here. Here, we test the tiiuae/falcon-7b-instruct model. You can use the same code previously used for the LLaMA example by changing the model_id.

model_id = "tiiuae/falcon-7b-instruct"

Dolly

Dolly is an open-source LLM introduced by Databricks. It was first unveiled as Dolly 1.0, a language model that showcased ChatGPT-like human interactivity. The team has now released Dolly 2.0, a better instruction-following LLM.

One of the critical features of Dolly 2.0 is that it is built on a new, high-quality human-generated instruction dataset called "databricks-dolly-15k". This dataset consists of 15,000 prompt/response pairs designed explicitly for instruction tuning large language models. Unlike many instruction-following models, Dolly 2.0's dataset is open-source and licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. This means that anyone can use, modify, or extend the dataset for any purpose, including commercial applications.

The Dolly 2.0 model is based on the EleutherAI Pythia-12 b architecture, comprising 12 billion parameters, which makes it capable of high-quality instruction following behavior. Despite being smaller than some other models, such as Alpaca, Dolly 2.0 has demonstrated great performance due to its reliance on real-world, human-generated training records rather than synthesized data.

You can find the Databricks models on the Hugging Face Hub here. Here, we test the databricks/dolly-v2-3b model. You can use the same code previously used for the LLaMA example by changing the model_id.

model_id = "databricks/dolly-v2-3b"

Open Assistant

The Open Assistant project is an initiative aiming to make high-quality large language models accessible to everyone through an open-source and collaborative approach. Unlike some other ChatGPT open-source alternatives with restricted licenses, Open Assistant seeks to provide a versatile chat-based language model comparable to ChatGPT and GPT-4 that can be used for commercial purposes.

The heart of the project lies in its commitment to openness and inclusivity. They have collected a substantial dataset from over 13,000 volunteers, comprising more than 600,000 interactions, 150,000 messages, and 10,000 fully annotated conversation trees on various topics and in multiple languages. This dataset serves as the foundation for training various models hosted on platforms like Hugging Face.

Users can explore the potential of Open Assistant by interacting with the model through the Hugging Face demo or the official chat interface, both designed to solicit user feedback to help improve the chatbot's responses. The project encourages community involvement and contributions, allowing users to participate in data collection and ranking tasks to enhance the capabilities of the language model.

As with most open-source large language models, Open Assistant does have some limitations, particularly in answering math and coding questions, as they are trained on fewer interactions in these domains. However, the model is generally adept at generating interesting and human-like responses, though occasional inaccuracies may occur.

Mistral

Mistral, in September 2023, has released their language model Mistral 7B under the Apache 2.0 license. This model, with 7.3 billion parameters, has shown superior performance compared to the Llama 2 13B and Llama 1 34B models on all and many benchmarks respectively. It also approaches the performance of CodeLlama 7B on code while maintaining proficiency in English tasks.

Mistral 7B uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences more cost-effectively. This, along with modifications to FlashAttention and xFormers, has led to a 2x speed improvement for sequence lengths of 16k with a window of 4k.

The model can be downloaded and used anywhere, including locally, with the team's reference implementation. It can also be deployed on any cloud (AWS/GCP/Azure) using the vLLM inference server, or used on HuggingFace.

Mistral 7B is easily fine-tuned for any task. As a demonstration, the team has provided a model fine-tuned for chat, which outperforms the Llama 2 13B chat model. The fine-tuned model, Mistral 7B Instruct, outperforms all 7B models on MT-Bench and is comparable to 13B chat models.

The Hugging Face Open LLM Leaderboard

Hugging Face hosts an LLM leaderboard. This leaderboard is created by evaluating community-submitted models on text generation benchmarks on Hugging Face’s clusters. It’s an excellent resource for checking the new best-performant open-source LLMs.

If you can’t find the language or domain you’re looking for, you can filter them out and find the one that meets your specific requirements.

Conclusion

In this lesson, we explored several open-source LLMs) and their features, capabilities, and licenses. We discussed LLaMA 2, Falcon, Dolly, and Open Assistant as some of the most prominent open-source LLMs available.

LLaMA 2, developed by Meta, is a cutting-edge language model with impressive performance and is available in various parameter variations. It has been trained on a massive scale and demonstrates remarkable performance comparable to GPT3.5.
Falcon models, developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi, have gained attention for their decoder-only approach and have shown great performance, especially the Falcon-40B model.
Dolly, introduced by Databricks, is an open-source LLM with a focus on instruction following. It has a high-quality human-generated instruction dataset and is licensed under Creative Commons, allowing for versatile use, including commercial applications.
Open Assistant is an ambitious project aiming to make high-quality LLMs accessible to everyone through openness and inclusivity. It encourages community involvement and contributions to enhance the capabilities of the language model.

It is essential to acknowledge the importance of open-source LLMs in advancing the field of natural language processing and enabling wider access to state-of-the-art language models for research and commercial purposes.

In the next lesson, we will explore an equally important aspect of LLMs - hallucinations and bias. Hallucinations refer to the generation of fake or incorrect information by LLMs, while bias entails the perpetuation of prejudiced or discriminatory content. Understanding and addressing these challenges are crucial to ensuring the responsible and ethical use of large language models in various applications.