Leaked benchmarks regarding Meta AI’s Llama 3.1 405B show that this open-source LLM has a lot of potential.
Leaked: Meta AI Llama 3.1 405B benchmarks
Meta introduced Llama 3 in April 2024 as a new generation of cutting-edge, open-source large language models. The initial release included Llama 3 8B and Llama 3 70B, both of which established new performance benchmarks for LLMs in their respective sizes. However, within just three months, several other models have managed to surpass these initial benchmarks, indicating the rapid pace of advancement in the field of artificial intelligence.
Meta has announced that its most ambitious model in the Llama 3 series will boast over 400 billion parameters, a massive leap in scale that is still undergoing training. In a dramatic turn of events, early benchmark data for the forthcoming Llama 3.1 models—including the 8B, 70B, and the colossal 405B—were leaked on the LocalLLaMA subreddit today. The preliminary results suggest that the Llama 3.1 405B model could potentially surpass the performance of the current industry leader, OpenAI’s GPT-4o, across several critical AI benchmarks.
Should the Llama 3.1 405B model indeed surpass GPT-4o, it would represent the first instance of an open-source model eclipsing a leading closed-source LLM.
Benchmarks | GPT-4o | Meta Llama-3.1-405B | Meta Llama-3.1-70B | Meta Llama-3-70B | Meta Llama-3.1-8B | Meta Llama-3-8B |
boolq | 0.905 | 0.921 | 0.909 | 0.892 | 0.871 | 0.82 |
gsm8k | 0.942 | 0.968 | 0.948 | 0.833 | 0.844 | 0.572 |
hellaswag | 0.891 | 0.92 | 0.908 | 0.874 | 0.768 | 0.462 |
human_eval | 0.921 | 0.854 | 0.793 | 0.39 | 0.683 | 0.341 |
mmlu_humanities | 0.802 | 0.818 | 0.795 | 0.706 | 0.619 | 0.56 |
mmlu_other | 0.872 | 0.875 | 0.852 | 0.825 | 0.74 | 0.709 |
mmlu_social_sciences | 0.913 | 0.898 | 0.878 | 0.872 | 0.761 | 0.741 |
mmlu_stem | 0.696 | 0.831 | 0.771 | 0.696 | 0.595 | 0.561 |
openbookqa | 0.882 | 0.908 | 0.936 | 0.928 | 0.852 | 0.802 |
piqa | 0.844 | 0.874 | 0.862 | 0.894 | 0.801 | 0.764 |
social_iqa | 0.79 | 0.797 | 0.813 | 0.789 | 0.734 | 0.667 |
truthfulqa_mc1 | 0.825 | 0.8 | 0.769 | 0.52 | 0.606 | 0.327 |
winogrande | 0.822 | 0.867 | 0.845 | 0.776 | 0.65 | 0.56 |
As you can see above, leaked benchmarks reveal that Meta’s Llama 3.1 models outshine OpenAI’s GPT-4 in a variety of tests, establishing a new standard in several crucial areas of AI performance. Notably, Llama 3.1 excels in benchmarks such as GSM8K, Hellaswag, BoolQ, MMLU-humanities, MMLU-other, MMLU-STEM, and Winograd. However, it trails behind in the HumanEval and MMLU-social sciences tests, indicating areas where further refinement is needed.
It is critical to recognize that these benchmarks reflect the performance of the base models of Llama 3.1. The true potential of these models can be realized through instruction-tuning, a process that can significantly enhance their capabilities. The forthcoming Instruct versions of the Llama 3.1 models are expected to yield even better results, showcasing improvements across various benchmarks.
Stressing out the importance of open-source initiatives
While GPT-5 may challenge Llama 3.1’s emerging dominance, the impressive performance of Llama 3.1 against GPT-4 underscores the growing influence and capability of open-source AI initiatives.
“We are embracing the open source ethos of releasing early and often to enable the community to get access to these models while they are still in development. The text-based models we are releasing today are the first in the Llama 3 collection of models. Our goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across core LLM capabilities such as reasoning and coding,” stated Meta in a blog post when launching Llama 3.
The significance of open-source AI cannot be overstated. By making their advanced models accessible to the public, Meta not only democratizes technology but also taps into the collective intelligence and diverse perspectives of the global developer community. This approach contrasts sharply with closed-source models, which are typically accessible only to a select group of users and researchers, thereby limiting the potential for widespread innovation and enhancement.
Featured image credit: Penfer/Unsplash