Benchmarking Your Own LLM

Introduction

In a previous lesson, we touched upon several methodologies for assessing the effectiveness of different language models. Even with various available methodologies for evaluating a large language model, the process continues to pose significant challenges.

The primary challenge stems from the inherent subjectivity in determining what constitutes a good answer. As an example, let's consider a generative task such as summarization. Quantifying a specific summary as the definitive correct answer in terms of context is difficult and hard to define. This challenge is prevalent across every generative task, making it a pervasive source of difficulty.

In this lesson, we’ll see how benchmarks are a candidate solution to this problem.

Benchmarks Over Several Tasks

The solution suggests curating a set of benchmarks to evaluate the model's performance across various tasks. The benchmarks encompass assessments for world knowledge, following complex instructions, arithmetic, programming, and more.

Several leaderboards exist to monitor the progress of LLMs, including the “Open LLM Leaderboard” and the “InstructEval Leaderboard.” In some cases, these benchmarks share similar metrics.

Here are some examples of tasks tested in these benchmarks:

AI2 Reasoning Challenge (ARC): The dataset is exclusively comprised of natural, grade-school science questions designed for human tests.
HumanEval: It is used to measure program synthesis from docstrings. It includes 164 original programming problems assessing language comprehension, algorithms, and math, some resembling software interview questions.
HellaSwag: A challenge to measure commonsense inference, and shows it remains difficult for state-of-the-art models. While humans achieve >95% accuracy on trivial questions, the models struggle with <48% accuracy.
Measuring Massive Multitask Language Understanding (MMLU): An evaluation metric for text models' multitask accuracy, covering 57 tasks, including math, US history, computer science, law, etc. High accuracy requires extensive knowledge of the world and problem-solving ability.
TruthfulQA: A truthfulness benchmark designed to assess the accuracy of language models in generating answers to questions. It consists of 817 questions across 38 categories, encompassing topics such as health, law, finance, and politics.

Language Model Evaluation Harness

EleutherAI has released a benchmarking script capable of evaluating any language model, whether it is proprietary and accessible via API or an open-source model. It is customized for generative tasks, utilizing publicly available prompts to enable fair comparisons among the papers. As of writing this lesson, the script supports over 200 diverse evaluation metrics tailored for a wide array of tasks.

Reproducibility stands as a vital aspect of any evaluation process! Especially in generative models, there are numerous parameters available during inference, each offering varying levels of randomness. They employ a task versioning feature that guarantees comparability of results even after the tasks undergo updates. Let’s look at the benchmarking process of using the library to evaluate an open-source model.

To begin, you need to fetch the script from GitHub. The following code will pin the script version to 0.3.0. If you are working with the script in a Google Colab/Notebook environment, utilizing the sample code provided at the end of this lesson is advisable.

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd llama.cpp && git checkout e2eb966 # Pin to version 0.3.0

The sample script.

Now, we can access the library files and list all the available tasks it supports. By printing the internal variable named ALL_TASKS, you can obtain a list of all the metrics that have been implemented.

from lm_eval import tasks

print(tasks.ALL_TASKS)

The sample code.

['Ceval-valid-accountant', 'Ceval-valid-advanced_mathematics', 'Ceval-valid-art_studies', 'Ceval-valid-basic_medicine', 'Ceval-valid-business_administration', 'Ceval-valid-chinese_language_and_literature', 'Ceval-valid-civil_servant', 'Ceval-valid-clinical_medicine', 'Ceval-valid-college_chemistry', 'Ceval-valid-college_economics', 'Ceval-valid-college_physics', 'Ceval-valid-college_programming', 'Ceval-valid-computer_architecture', 'Ceval-valid-computer_network', 'Ceval-valid-discrete_mathematics', 'Ceval-valid-education_science', 'Ceval-valid-electrical_engineer', 'Ceval-valid-environmental_impact_assessment_engineer', 'Ceval-valid-fire_engineer', 'Ceval-valid-high_school_biology', 'Ceval-valid-high_school_chemistry', 'Ceval-valid-high_school_chinese', 'Ceval-valid-high_school_geography', 'Ceval-valid-high_school_history', 'Ceval-valid-high_school_mathematics', 'Ceval-valid-high_school_physics', 'Ceval-valid-high_school_politics', 'Ceval-valid-ideological_and_moral_cultivation', 'Ceval-valid-law', 'Ceval-valid-legal_professional', 'Ceval-valid-logic', 'Ceval-valid-mao_zedong_thought', 'Ceval-valid-marxism', 'Ceval-valid-metrology_engineer', 'Ceval-valid-middle_school_biology', 'Ceval-valid-middle_school_chemistry', 'Ceval-valid-middle_school_geography', 'Ceval-valid-middle_school_history', 'Ceval-valid-middle_school_mathematics', 'Ceval-valid-middle_school_physics', 'Ceval-valid-middle_school_politics', 'Ceval-valid-modern_chinese_history', 'Ceval-valid-operating_system', 'Ceval-valid-physician', 'Ceval-valid-plant_protection', 'Ceval-valid-probability_and_statistics', 'Ceval-valid-professional_tour_guide', 'Ceval-valid-sports_science', 'Ceval-valid-tax_accountant', 'Ceval-valid-teacher_qualification', 'Ceval-valid-urban_and_rural_planner', 'Ceval-valid-veterinary_medicine', 'anagrams1', 'anagrams2', 'anli_r1', 'anli_r2', 'anli_r3', 'arc_challenge', 'arc_easy', 'arithmetic_1dc', 'arithmetic_2da', 'arithmetic_2dm', 'arithmetic_2ds', 'arithmetic_3da', 'arithmetic_3ds', 'arithmetic_4da', 'arithmetic_4ds', 'arithmetic_5da', 'arithmetic_5ds', 'babi', 'bigbench___init__', 'bigbench_causal_judgement', 'bigbench_date_understanding', 'bigbench_disambiguation_qa', 'bigbench_dyck_languages', 'bigbench_formal_fallacies_syllogisms_negation', 'bigbench_geometric_shapes', 'bigbench_hyperbaton', 'bigbench_logical_deduction_five_objects', 'bigbench_logical_deduction_seven_objects', 'bigbench_logical_deduction_three_objects', 'bigbench_movie_recommendation', 'bigbench_navigate', 'bigbench_reasoning_about_colored_objects', 'bigbench_ruin_names', 'bigbench_salient_translation_error_detection', 'bigbench_snarks', 'bigbench_sports_understanding', 'bigbench_temporal_sequences', 'bigbench_tracking_shuffled_objects_five_objects', 'bigbench_tracking_shuffled_objects_seven_objects', 'bigbench_tracking_shuffled_objects_three_objects', 'blimp_adjunct_island', 'blimp_anaphor_gender_agreement', 'blimp_anaphor_number_agreement', 'blimp_animate_subject_passive', 'blimp_animate_subject_trans', 'blimp_causative', 'blimp_complex_NP_island', 'blimp_coordinate_structure_constraint_complex_left_branch', 'blimp_coordinate_structure_constraint_object_extraction', 'blimp_determiner_noun_agreement_1', 'blimp_determiner_noun_agreement_2', 'blimp_determiner_noun_agreement_irregular_1', 'blimp_determiner_noun_agreement_irregular_2', 'blimp_determiner_noun_agreement_with_adj_2', 'blimp_determiner_noun_agreement_with_adj_irregular_1', 'blimp_determiner_noun_agreement_with_adj_irregular_2', 'blimp_determiner_noun_agreement_with_adjective_1', 'blimp_distractor_agreement_relational_noun', 'blimp_distractor_agreement_relative_clause', 'blimp_drop_argument', 'blimp_ellipsis_n_bar_1', 'blimp_ellipsis_n_bar_2', 'blimp_existential_there_object_raising', 'blimp_existential_there_quantifiers_1', 'blimp_existential_there_quantifiers_2', 'blimp_existential_there_subject_raising', 'blimp_expletive_it_object_raising', 'blimp_inchoative', 'blimp_intransitive', 'blimp_irregular_past_participle_adjectives', 'blimp_irregular_past_participle_verbs', 'blimp_irregular_plural_subject_verb_agreement_1', 'blimp_irregular_plural_subject_verb_agreement_2', 'blimp_left_branch_island_echo_question', 'blimp_left_branch_island_simple_question', 'blimp_matrix_question_npi_licensor_present', 'blimp_npi_present_1', 'blimp_npi_present_2', 'blimp_only_npi_licensor_present', 'blimp_only_npi_scope', 'blimp_passive_1', 'blimp_passive_2', 'blimp_principle_A_c_command', 'blimp_principle_A_case_1', 'blimp_principle_A_case_2', 'blimp_principle_A_domain_1', 'blimp_principle_A_domain_2', 'blimp_principle_A_domain_3', 'blimp_principle_A_reconstruction', 'blimp_regular_plural_subject_verb_agreement_1', 'blimp_regular_plural_subject_verb_agreement_2', 'blimp_sentential_negation_npi_licensor_present', 'blimp_sentential_negation_npi_scope', 'blimp_sentential_subject_island', 'blimp_superlative_quantifiers_1', 'blimp_superlative_quantifiers_2', 'blimp_tough_vs_raising_1', 'blimp_tough_vs_raising_2', 'blimp_transitive', 'blimp_wh_island', 'blimp_wh_questions_object_gap', 'blimp_wh_questions_subject_gap', 'blimp_wh_questions_subject_gap_long_distance', 'blimp_wh_vs_that_no_gap', 'blimp_wh_vs_that_no_gap_long_distance', 'blimp_wh_vs_that_with_gap', 'blimp_wh_vs_that_with_gap_long_distance', 'boolq', 'cb', 'cola', 'copa', 'coqa', 'crows_pairs_english', 'crows_pairs_english_age', 'crows_pairs_english_autre', 'crows_pairs_english_disability', 'crows_pairs_english_gender', 'crows_pairs_english_nationality', 'crows_pairs_english_physical_appearance', 'crows_pairs_english_race_color', 'crows_pairs_english_religion', 'crows_pairs_english_sexual_orientation', 'crows_pairs_english_socioeconomic', 'crows_pairs_french', 'crows_pairs_french_age', 'crows_pairs_french_autre', 'crows_pairs_french_disability', 'crows_pairs_french_gender', 'crows_pairs_french_nationality', 'crows_pairs_french_physical_appearance', 'crows_pairs_french_race_color', 'crows_pairs_french_religion', 'crows_pairs_french_sexual_orientation', 'crows_pairs_french_socioeconomic', 'csatqa_gr', 'csatqa_li', 'csatqa_rch', 'csatqa_rcs', 'csatqa_rcss', 'csatqa_wr', 'cycle_letters', 'drop', 'ethics_cm', 'ethics_deontology', 'ethics_justice', 'ethics_utilitarianism', 'ethics_utilitarianism_original', 'ethics_virtue', 'gsm8k', 'haerae_hi', 'haerae_kgk', 'haerae_lw', 'haerae_rc', 'haerae_rw', 'haerae_sn', 'headqa', 'headqa_en', 'headqa_es', 'hellaswag', 'hendrycksTest-abstract_algebra', 'hendrycksTest-anatomy', 'hendrycksTest-astronomy', 'hendrycksTest-business_ethics', 'hendrycksTest-clinical_knowledge', 'hendrycksTest-college_biology', 'hendrycksTest-college_chemistry', 'hendrycksTest-college_computer_science', 'hendrycksTest-college_mathematics', 'hendrycksTest-college_medicine', 'hendrycksTest-college_physics', 'hendrycksTest-computer_security', 'hendrycksTest-conceptual_physics', 'hendrycksTest-econometrics', 'hendrycksTest-electrical_engineering', 'hendrycksTest-elementary_mathematics', 'hendrycksTest-formal_logic', 'hendrycksTest-global_facts', 'hendrycksTest-high_school_biology', 'hendrycksTest-high_school_chemistry', 'hendrycksTest-high_school_computer_science', 'hendrycksTest-high_school_european_history', 'hendrycksTest-high_school_geography', 'hendrycksTest-high_school_government_and_politics', 'hendrycksTest-high_school_macroeconomics', 'hendrycksTest-high_school_mathematics', 'hendrycksTest-high_school_microeconomics', 'hendrycksTest-high_school_physics', 'hendrycksTest-high_school_psychology', 'hendrycksTest-high_school_statistics', 'hendrycksTest-high_school_us_history', 'hendrycksTest-high_school_world_history', 'hendrycksTest-human_aging', 'hendrycksTest-human_sexuality', 'hendrycksTest-international_law', 'hendrycksTest-jurisprudence', 'hendrycksTest-logical_fallacies', 'hendrycksTest-machine_learning', 'hendrycksTest-management', 'hendrycksTest-marketing', 'hendrycksTest-medical_genetics', 'hendrycksTest-miscellaneous', 'hendrycksTest-moral_disputes', 'hendrycksTest-moral_scenarios', 'hendrycksTest-nutrition', 'hendrycksTest-philosophy', 'hendrycksTest-prehistory', 'hendrycksTest-professional_accounting', 'hendrycksTest-professional_law', 'hendrycksTest-professional_medicine', 'hendrycksTest-professional_psychology', 'hendrycksTest-public_relations', 'hendrycksTest-security_studies', 'hendrycksTest-sociology', 'hendrycksTest-us_foreign_policy', 'hendrycksTest-virology', 'hendrycksTest-world_religions', 'iwslt17-ar-en', 'iwslt17-en-ar', 'lambada_openai', 'lambada_openai_cloze', 'lambada_openai_mt_de', 'lambada_openai_mt_en', 'lambada_openai_mt_es', 'lambada_openai_mt_fr', 'lambada_openai_mt_it', 'lambada_standard', 'lambada_standard_cloze', 'logiqa', 'math_algebra', 'math_asdiv', 'math_counting_and_prob', 'math_geometry', 'math_intermediate_algebra', 'math_num_theory', 'math_prealgebra', 'math_precalc', 'mathqa', 'mc_taco', 'mgsm_bn', 'mgsm_de', 'mgsm_en', 'mgsm_es', 'mgsm_fr', 'mgsm_ja', 'mgsm_ru', 'mgsm_sw', 'mgsm_te', 'mgsm_th', 'mgsm_zh', 'mnli', 'mnli_mismatched', 'mrpc', 'multirc', 'mutual', 'mutual_plus', 'openbookqa', 'pawsx_de', 'pawsx_en', 'pawsx_es', 'pawsx_fr', 'pawsx_ja', 'pawsx_ko', 'pawsx_zh', 'pile_arxiv', 'pile_bookcorpus2', 'pile_books3', 'pile_dm-mathematics', 'pile_enron', 'pile_europarl', 'pile_freelaw', 'pile_github', 'pile_gutenberg', 'pile_hackernews', 'pile_nih-exporter', 'pile_opensubtitles', 'pile_openwebtext2', 'pile_philpapers', 'pile_pile-cc', 'pile_pubmed-abstracts', 'pile_pubmed-central', 'pile_stackexchange', 'pile_ubuntu-irc', 'pile_uspto', 'pile_wikipedia', 'pile_youtubesubtitles', 'piqa', 'prost', 'pubmedqa', 'qa4mre_2011', 'qa4mre_2012', 'qa4mre_2013', 'qasper', 'qnli', 'qqp', 'race', 'random_insertion', 'record', 'reversed_words', 'rte', 'sciq', 'scrolls_contractnli', 'scrolls_govreport', 'scrolls_narrativeqa', 'scrolls_qasper', 'scrolls_qmsum', 'scrolls_quality', 'scrolls_summscreenfd', 'squad2', 'sst', 'swag', 'toxigen', 'triviaqa', 'truthfulqa_gen', 'truthfulqa_mc', 'webqs', 'wic', 'wikitext', 'winogrande', 'wmt14-en-fr', 'wmt14-fr-en', 'wmt16-de-en', 'wmt16-en-de', 'wmt16-en-ro', 'wmt16-ro-en', 'wmt20-cs-en', 'wmt20-de-en', 'wmt20-de-fr', 'wmt20-en-cs', 'wmt20-en-de', 'wmt20-en-iu', 'wmt20-en-ja', 'wmt20-en-km', 'wmt20-en-pl', 'wmt20-en-ps', 'wmt20-en-ru', 'wmt20-en-ta', 'wmt20-en-zh', 'wmt20-fr-de', 'wmt20-iu-en', 'wmt20-ja-en', 'wmt20-km-en', 'wmt20-pl-en', 'wmt20-ps-en', 'wmt20-ru-en', 'wmt20-ta-en', 'wmt20-zh-en', 'wnli', 'wsc', 'wsc273', 'xcopa_et', 'xcopa_ht', 'xcopa_id', 'xcopa_it', 'xcopa_qu', 'xcopa_sw', 'xcopa_ta', 'xcopa_th', 'xcopa_tr', 'xcopa_vi', 'xcopa_zh', 'xnli_ar', 'xnli_bg', 'xnli_de', 'xnli_el', 'xnli_en', 'xnli_es', 'xnli_fr', 'xnli_hi', 'xnli_ru', 'xnli_sw', 'xnli_th', 'xnli_tr', 'xnli_ur', 'xnli_vi', 'xnli_zh', 'xstory_cloze_ar', 'xstory_cloze_en', 'xstory_cloze_es', 'xstory_cloze_eu', 'xstory_cloze_hi', 'xstory_cloze_id', 'xstory_cloze_my', 'xstory_cloze_ru', 'xstory_cloze_sw', 'xstory_cloze_te', 'xstory_cloze_zh', 'xwinograd_en', 'xwinograd_fr', 'xwinograd_jp', 'xwinograd_pt', 'xwinograd_ru', 'xwinograd_zh']

The output.

To execute the evaluation, you must utilize the main.py file previously cloned from Github. Ensure that you are in the same directory as the specified file to execute the command successfully. By running the provided command, you will utilize the facebook/opt-1.3b model and evaluate its performance on the hellaswag dataset using GPU acceleration. ( Note that the displayed output is truncated. For the complete output, feel free to explore the attached notebook.)

python main.py \
    --model hf-causal \
    --model_args pretrained=facebook/opt-1.3b \
    --tasks hellaswag \
    --device cuda:0

The sample code.

Running loglikelihood requests
100% 40145/40145 [29:44<00:00, 22.50it/s]
{
  "results": {
    "hellaswag": {
      "acc": 0.4146584345747859,
      "acc_stderr": 0.00491656121359129,
      "acc_norm": 0.5368452499502091,
      "acc_norm_stderr": 0.004976214989483508
    }
  },
  "versions": {
    "hellaswag": 0
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=facebook/opt-1.3b",
    "num_fewshot": 0,
    "batch_size": null,
    "batch_sizes": [],
    "device": "cuda:0",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|  Task   |Version| Metric |Value |   |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag|      0|acc     |0.4147|±  |0.0049|
|         |       |acc_norm|0.5368|±  |0.0050|

The truncated output.

For a more in-depth exploration, the --model argument offers three options to choose from: hf-causal for specifying the language model, hf-causal-experimental for utilizing multiple GPUs, and hf-seq2seq for evaluating encoder-decoder models.

Consequently, the --model_args parameter can be used to pass any additional arguments to the model. For instance, to employ a specific revision of the model with the float data type, utilize the following input: --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float.” The arguments vary based on the chosen model and the inputs accepted by Huggingface, as this library leverages Huggingface to load open-source models. Otherwise, you can use the following to specify the engine type while evaluating OpenAI models: --model_args engine=davinci.

Lastly, performing a combined evaluation using a combination of tasks is possible. To achieve this, simply pass a comma-separated string of available metrics like --tasks hellaswag, and arc_challenge, enabling the usage of both Hellaswag and ARC metrics simultaneously.

To evaluate proprietary models from OpenAI, you need to set the OPENAI_API_SECRET_KEY environmental variable with the secret key. You can obtain this key from the OpenAI dashboard and use it accordingly.

export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE

python main.py \
    --model gpt3 \
    --model_args engine=davinci \
    --tasks hellaswag

The sample code.

InstructEval

There are other efforts to evaluate the language model, like the InstructEval leaderboard, which is an effort that combines automated evaluation and using the GPT-4 model for scoring different models. Additionally, it is worth mentioning that they mainly focus on instruction-tuned models.

Image from the InstructEval paper.

The evaluation is broken down into three distinct tasks.

1. Problem-Solving Evaluation

It consists of the following test to evaluate the model’s ability on World Knowledge using Massive Multitask Language Understanding (MMLU), Complex Instructions using BIG-Bench Hard (BBH), Comprehension and Arithmetic using Discrete Reasoning Over Paragraphs (DROP), Programming using HumanEval, and lastly Causality using Counterfactual Reasoning Assessment (CRASS). These automated evaluations assess the model's performance across various tasks.

2. Writing Evaluation

This category will evaluate the model based on the following subjective metrics: Informative, Professional, Argumentative, and Creative. They used the GPT-4 model to evaluate the output of different models by presenting a rubric and asking the model to score the outputs on the Likert scale between 1 and 5.

3. Alignment to Human Values

Finally, a crucial aspect of instruction-tuned models is their alignment with human values. We anticipate these models will uphold values such as helpfulness, honesty, and harmlessness. The leaderboard will evaluate the model by presenting pairs of dialogues and asking it to choose the appropriate one.

We won't dive into an extensive explanation of the evaluation process as it closely resembles the previous benchmark, involving the execution of a Python script. Please follow the link to their GitHub repository, where they provide sample usage.

Conclusion

Having standardized metrics for evaluating different models is essential; otherwise, comparing the capabilities of various models would become impractical.

In this lesson, we introduced several widely used metrics along with a script that facilitates the evaluation of LLMs. It is essential to emphasize the importance of keeping track of the latest leaderboard and evaluation metrics based on specific use cases. In most instances, having a model that excels in all tasks may not be necessary, so staying updated with relevant metrics helps identify the most suitable model for tailored requirements.

>> Notebook.