Evaluation with MMLU dataset
How to run
To run the MMLU benchmark with a specific model, execute the following command from the recipe directory (recipes/mmlu
):
eole tools run_mmlu -c <model_inference_config>
For instance, following the llama3 recipe:
eole tools run_mmlu -c ../llama3/llama-mmlu.yaml
Results
Note: Below are the legacy OpenNMT-py results. We might re-run and update everything at some point.
All evaluations below have been computed with the OpenNMT-py converted models.
The evaluation script is taken from the https://github.com/FranxYao/chain-of-thought-hub repo and modified to use the OpenNMT-py models
There is a difference compared to the original MMLU Hendrycks script.
We do not compare the logprobs of A, B, C, D to determine the answer, we actually decode the next token after the prompt.
When the model is Sentencepiece based the next token can be 'A', 'B', 'C', 'D' or any other token. When the model is BPE based the tokens will be ' A', ' B', ' C', ' D' because the leading space is encoded with the letter, We strip that space to compute the metric.
For 7B params models:
-
Llama7B score (35.25) matches both the Llama paper and the score reported by chain-of-thought-hub
-
Falcon7B is a little higher then the score reported by chain-of-thought-hub (0.2641)
-
I ran MPT7B with chain-of-thought-hub and found 28.46, again ours is a little higher.
-
There are major discrepancies between those scores and Open LLM leaderboard of HF for MPT, Falcon, Redpajama that are way higher on the leaderboard.
For 13B, 33B, 40B models, we score with the 4-bit loading option, hence for Llama13B a score slightly under the paper (46.9), same for 33B (paper is 57.8)
MPT7B | Redpajama7B | Open Llama7B | Falcon7B | xgen7B | Flan-T5-3B | Llama7B | Llama-2-7B | Llama-2-chat-7B | Open Llama13B | Llama13B | Llama-2-13B | Llama-2-chat-13B | Falcon40B | Llama33B | Llama-2-70B | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC-all | 0.2958 | 0.2745 | 0.3007 | 0.2765 | 0.3468 | 0.4929 | 0.3525 | 0.4587 | 0.4569 | 0.4148 | 0.4472 | 0.5429 | 0.5217 | 0.5499 | 0.5701 | 0.6875 |
ACC-abstract_algebra | 0.2200 | 0.2500 | 0.3000 | 0.2400 | 0.2900 | 0.2700 | 0.2500 | 0.3000 | 0.3100 | 0.3200 | 0.2800 | 0.3100 | 0.3500 | 0.3200 | 0.3700 | 0.3900 |
ACC-anatomy | 0.2963 | 0.2667 | 0.3333 | 0.2444 | 0.3185 | 0.4296 | 0.3852 | 0.4815 | 0.4222 | 0.4667 | 0.4889 | 0.5037 | 0.5037 | 0.5111 | 0.5185 | 0.6296 |
ACC-astronomy | 0.2961 | 0.2763 | 0.2500 | 0.2434 | 0.3355 | 0.4737 | 0.3487 | 0.4079 | 0.4803 | 0.4737 | 0.4671 | 0.5263 | 0.5461 | 0.5658 | 0.6118 | 0.7895 |
ACC-business_ethics | 0.2900 | 0.2900 | 0.3200 | 0.1900 | 0.3200 | 0.6800 | 0.4100 | 0.5300 | 0.4200 | 0.4100 | 0.4300 | 0.5500 | 0.5000 | 0.5500 | 0.5800 | 0.6900 |
ACC-clinical_knowledge | 0.2943 | 0.3208 | 0.3887 | 0.3019 | 0.3057 | 0.5245 | 0.3585 | 0.4604 | 0.5208 | 0.4113 | 0.4189 | 0.5811 | 0.5698 | 0.6113 | 0.5547 | 0.7019 |
ACC-college_biology | 0.3056 | 0.3125 | 0.3264 | 0.2153 | 0.3958 | 0.4444 | 0.3819 | 0.4722 | 0.5417 | 0.4167 | 0.4722 | 0.5694 | 0.5347 | 0.6319 | 0.5833 | 0.8333 |
ACC-college_chemistry | 0.2800 | 0.2700 | 0.2400 | 0.2300 | 0.2500 | 0.3400 | 0.2900 | 0.3400 | 0.2500 | 0.2800 | 0.2400 | 0.3900 | 0.3600 | 0.4100 | 0.3800 | 0.5200 |
ACC-college_computer_science | 0.3100 | 0.3100 | 0.3100 | 0.3000 | 0.3300 | 0.3600 | 0.2900 | 0.3400 | 0.3600 | 0.4000 | 0.3700 | 0.4600 | 0.5100 | 0.4700 | 0.4400 | 0.6000 |
ACC-college_mathematics | 0.2900 | 0.2500 | 0.2800 | 0.2900 | 0.3200 | 0.2900 | 0.3400 | 0.3800 | 0.3400 | 0.3200 | 0.2500 | 0.3000 | 0.2800 | 0.3500 | 0.3600 | 0.3700 |
ACC-college_medicine | 0.2890 | 0.2659 | 0.3179 | 0.2659 | 0.3410 | 0.4277 | 0.3237 | 0.4220 | 0.4104 | 0.3699 | 0.4220 | 0.5318 | 0.4451 | 0.4798 | 0.5376 | 0.6532 |
ACC-college_physics | 0.2157 | 0.2451 | 0.1863 | 0.2157 | 0.2353 | 0.2941 | 0.2451 | 0.2255 | 0.2451 | 0.2549 | 0.1863 | 0.2647 | 0.3137 | 0.3333 | 0.3137 | 0.3333 |
ACC-computer_security | 0.3100 | 0.3600 | 0.3800 | 0.2800 | 0.3900 | 0.6400 | 0.4500 | 0.6200 | 0.5400 | 0.5400 | 0.6300 | 0.6900 | 0.6700 | 0.6500 | 0.6800 | 0.8100 |
ACC-conceptual_physics | 0.3362 | 0.2723 | 0.3064 | 0.3149 | 0.3489 | 0.4085 | 0.3702 | 0.4170 | 0.3872 | 0.3574 | 0.3915 | 0.4511 | 0.3787 | 0.4170 | 0.4723 | 0.6723 |
ACC-econometrics | 0.2895 | 0.2368 | 0.2895 | 0.2632 | 0.2632 | 0.2807 | 0.2632 | 0.2632 | 0.3333 | 0.3070 | 0.2719 | 0.2895 | 0.3158 | 0.3246 | 0.3333 | 0.4123 |
ACC-electrical_engineering | 0.2897 | 0.3034 | 0.3034 | 0.2828 | 0.3862 | 0.4552 | 0.2483 | 0.4759 | 0.4345 | 0.4966 | 0.3862 | 0.5172 | 0.5103 | 0.5034 | 0.4690 | 0.6276 |
ACC-elementary_mathematics | 0.2698 | 0.2646 | 0.2698 | 0.2593 | 0.2725 | 0.3148 | 0.2646 | 0.2672 | 0.2857 | 0.2487 | 0.2487 | 0.3360 | 0.3333 | 0.3413 | 0.3413 | 0.4180 |
ACC-formal_logic | 0.2540 | 0.4048 | 0.2381 | 0.1905 | 0.2619 | 0.3333 | 0.2619 | 0.2698 | 0.2381 | 0.3016 | 0.3889 | 0.3492 | 0.2857 | 0.3413 | 0.3571 | 0.5000 |
ACC-global_facts | 0.2700 | 0.3200 | 0.3200 | 0.3100 | 0.3300 | 0.3600 | 0.3000 | 0.3200 | 0.3100 | 0.2900 | 0.3400 | 0.3200 | 0.2900 | 0.3300 | 0.3900 | 0.4500 |
ACC-high_school_biology | 0.3097 | 0.2484 | 0.2968 | 0.2645 | 0.3290 | 0.5645 | 0.3387 | 0.5065 | 0.5258 | 0.4290 | 0.5065 | 0.6742 | 0.6194 | 0.6516 | 0.6419 | 0.8194 |
ACC-high_school_chemistry | 0.2020 | 0.2660 | 0.2512 | 0.2512 | 0.2611 | 0.3300 | 0.2956 | 0.3744 | 0.3547 | 0.3350 | 0.2660 | 0.4286 | 0.4138 | 0.4187 | 0.3793 | 0.5468 |
ACC-high_school_computer_science | 0.3400 | 0.2700 | 0.2800 | 0.3200 | 0.3200 | 0.5100 | 0.3300 | 0.4000 | 0.4500 | 0.2700 | 0.4500 | 0.5500 | 0.5800 | 0.6000 | 0.5800 | 0.7700 |
ACC-high_school_european_history | 0.3455 | 0.2848 | 0.3455 | 0.2909 | 0.3879 | 0.7333 | 0.4667 | 0.6121 | 0.5818 | 0.4727 | 0.6121 | 0.6545 | 0.6667 | 0.6667 | 0.7152 | 0.8121 |
ACC-high_school_geography | 0.3737 | 0.3283 | 0.3333 | 0.1667 | 0.3636 | 0.6414 | 0.3333 | 0.4899 | 0.5960 | 0.4899 | 0.5000 | 0.6616 | 0.6616 | 0.7121 | 0.7273 | 0.8636 |
ACC-high_school_government_and_politics | 0.3782 | 0.2124 | 0.3575 | 0.2591 | 0.4352 | 0.6632 | 0.4611 | 0.6736 | 0.6632 | 0.5959 | 0.6425 | 0.8135 | 0.7617 | 0.7927 | 0.8187 | 0.9430 |
ACC-high_school_macroeconomics | 0.3821 | 0.2718 | 0.3564 | 0.2615 | 0.3359 | 0.5359 | 0.3410 | 0.4513 | 0.4103 | 0.4282 | 0.4256 | 0.4923 | 0.4744 | 0.5641 | 0.5590 | 0.7308 |
ACC-high_school_mathematics | 0.2778 | 0.2667 | 0.2407 | 0.2481 | 0.2333 | 0.3074 | 0.2630 | 0.2963 | 0.2556 | 0.2667 | 0.2593 | 0.2889 | 0.3037 | 0.3111 | 0.2741 | 0.3630 |
ACC-high_school_microeconomics | 0.2941 | 0.3067 | 0.2941 | 0.2899 | 0.3697 | 0.5168 | 0.3319 | 0.4412 | 0.4328 | 0.4370 | 0.4454 | 0.5630 | 0.5042 | 0.5504 | 0.5588 | 0.7605 |
ACC-high_school_physics | 0.2583 | 0.2649 | 0.2517 | 0.3179 | 0.2450 | 0.2980 | 0.2649 | 0.3179 | 0.3046 | 0.2980 | 0.2517 | 0.3444 | 0.3245 | 0.2914 | 0.3311 | 0.3907 |
ACC-high_school_psychology | 0.2844 | 0.3229 | 0.3505 | 0.2440 | 0.4752 | 0.6771 | 0.4789 | 0.6312 | 0.6477 | 0.5486 | 0.5835 | 0.7413 | 0.7229 | 0.7541 | 0.7596 | 0.8752 |
ACC-high_school_statistics | 0.4028 | 0.2454 | 0.3981 | 0.1852 | 0.1620 | 0.3657 | 0.3241 | 0.2778 | 0.3241 | 0.2546 | 0.2685 | 0.4722 | 0.3611 | 0.4630 | 0.4676 | 0.6157 |
ACC-high_school_us_history | 0.2892 | 0.2255 | 0.3137 | 0.2892 | 0.4167 | 0.6863 | 0.3284 | 0.5245 | 0.6765 | 0.5490 | 0.5343 | 0.7108 | 0.6863 | 0.7108 | 0.7696 | 0.9069 |
ACC-high_school_world_history | 0.2489 | 0.2785 | 0.2869 | 0.2996 | 0.3966 | 0.6667 | 0.4262 | 0.6245 | 0.6667 | 0.5105 | 0.6287 | 0.7089 | 0.7215 | 0.6835 | 0.7637 | 0.8608 |
ACC-human_aging | 0.3274 | 0.1659 | 0.2870 | 0.4215 | 0.4260 | 0.5650 | 0.3991 | 0.5695 | 0.5695 | 0.5157 | 0.5112 | 0.6502 | 0.6816 | 0.7130 | 0.6861 | 0.7848 |
ACC-human_sexuality | 0.3511 | 0.2519 | 0.2748 | 0.2901 | 0.3359 | 0.5802 | 0.3435 | 0.5649 | 0.4885 | 0.4962 | 0.5649 | 0.6031 | 0.5878 | 0.6794 | 0.6718 | 0.8550 |
ACC-international_law | 0.3802 | 0.2231 | 0.3636 | 0.2479 | 0.5041 | 0.6860 | 0.5207 | 0.6529 | 0.5620 | 0.5207 | 0.6860 | 0.6860 | 0.7851 | 0.6612 | 0.7603 | 0.8595 |
ACC-jurisprudence | 0.3704 | 0.2315 | 0.3426 | 0.3426 | 0.4074 | 0.6204 | 0.4167 | 0.5370 | 0.5833 | 0.4444 | 0.4722 | 0.6852 | 0.7037 | 0.6667 | 0.6574 | 0.8148 |
ACC-logical_fallacies | 0.2945 | 0.2638 | 0.2883 | 0.2638 | 0.3558 | 0.6319 | 0.4172 | 0.5092 | 0.5399 | 0.4847 | 0.5031 | 0.6564 | 0.6319 | 0.6503 | 0.6994 | 0.7975 |
ACC-machine_learning | 0.3125 | 0.2232 | 0.2321 | 0.3750 | 0.2589 | 0.3571 | 0.2768 | 0.3839 | 0.3393 | 0.3571 | 0.3304 | 0.3036 | 0.3482 | 0.3036 | 0.3750 | 0.5089 |
ACC-management | 0.3301 | 0.2816 | 0.2524 | 0.2816 | 0.3010 | 0.6796 | 0.3301 | 0.5631 | 0.6699 | 0.5243 | 0.6311 | 0.7379 | 0.7184 | 0.7184 | 0.7573 | 0.8252 |
ACC-marketing | 0.3120 | 0.2735 | 0.3761 | 0.2949 | 0.5385 | 0.7906 | 0.4615 | 0.6795 | 0.7265 | 0.5897 | 0.7094 | 0.8077 | 0.7821 | 0.7949 | 0.8333 | 0.8932 |
ACC-medical_genetics | 0.3100 | 0.2400 | 0.2700 | 0.2800 | 0.3600 | 0.4800 | 0.3700 | 0.5500 | 0.5000 | 0.5100 | 0.5100 | 0.5500 | 0.5700 | 0.6200 | 0.6100 | 0.7400 |
ACC-miscellaneous | 0.3001 | 0.2899 | 0.3678 | 0.2976 | 0.5326 | 0.6782 | 0.4278 | 0.6450 | 0.6692 | 0.5900 | 0.6296 | 0.7407 | 0.7458 | 0.7471 | 0.7752 | 0.8557 |
ACC-moral_disputes | 0.2977 | 0.2659 | 0.3295 | 0.3092 | 0.3613 | 0.5983 | 0.4133 | 0.5116 | 0.5145 | 0.4798 | 0.4566 | 0.6272 | 0.5809 | 0.6503 | 0.6503 | 0.7572 |
ACC-moral_scenarios | 0.2436 | 0.2469 | 0.2469 | 0.2492 | 0.2425 | 0.2436 | 0.2425 | 0.2380 | 0.2145 | 0.2715 | 0.2480 | 0.3464 | 0.2927 | 0.2615 | 0.3855 | 0.4413 |
ACC-nutrition | 0.2810 | 0.2908 | 0.3301 | 0.2582 | 0.3431 | 0.4804 | 0.3922 | 0.4902 | 0.5098 | 0.3758 | 0.5163 | 0.6144 | 0.5980 | 0.6405 | 0.6471 | 0.7778 |
ACC-philosophy | 0.3183 | 0.2830 | 0.2830 | 0.2830 | 0.3151 | 0.5177 | 0.4051 | 0.6013 | 0.5659 | 0.4662 | 0.5145 | 0.6656 | 0.6077 | 0.6399 | 0.6656 | 0.7781 |
ACC-prehistory | 0.3056 | 0.3210 | 0.3210 | 0.3117 | 0.3488 | 0.5216 | 0.3519 | 0.4907 | 0.5679 | 0.5216 | 0.5093 | 0.6451 | 0.5926 | 0.5988 | 0.6667 | 0.8272 |
ACC-professional_accounting | 0.2447 | 0.2872 | 0.2553 | 0.2979 | 0.3050 | 0.3723 | 0.2730 | 0.3582 | 0.3475 | 0.3050 | 0.3227 | 0.3830 | 0.3759 | 0.4255 | 0.4326 | 0.5780 |
ACC-professional_law | 0.2784 | 0.2705 | 0.2523 | 0.2497 | 0.2647 | 0.3990 | 0.2973 | 0.3553 | 0.3266 | 0.3064 | 0.3566 | 0.4068 | 0.3722 | 0.4296 | 0.4342 | 0.5404 |
ACC-professional_medicine | 0.2206 | 0.2059 | 0.2500 | 0.3125 | 0.4375 | 0.4412 | 0.4265 | 0.5184 | 0.3529 | 0.3860 | 0.5000 | 0.5221 | 0.4706 | 0.6176 | 0.5441 | 0.7390 |
ACC-professional_psychology | 0.2876 | 0.2925 | 0.2696 | 0.2647 | 0.3203 | 0.4526 | 0.3546 | 0.4428 | 0.4739 | 0.3693 | 0.4575 | 0.5392 | 0.5065 | 0.5539 | 0.6144 | 0.7500 |
ACC-public_relations | 0.3455 | 0.3182 | 0.4091 | 0.3364 | 0.4182 | 0.5909 | 0.4091 | 0.5273 | 0.5182 | 0.5273 | 0.5545 | 0.6364 | 0.6091 | 0.6364 | 0.6818 | 0.7273 |
ACC-security_studies | 0.3796 | 0.2816 | 0.2939 | 0.3102 | 0.2531 | 0.6531 | 0.3306 | 0.4980 | 0.4571 | 0.4245 | 0.5224 | 0.6122 | 0.6531 | 0.6735 | 0.6367 | 0.8082 |
ACC-sociology | 0.2239 | 0.2587 | 0.2488 | 0.3532 | 0.4826 | 0.7363 | 0.4726 | 0.6318 | 0.5771 | 0.5473 | 0.6418 | 0.7264 | 0.7214 | 0.7761 | 0.7761 | 0.8955 |
ACC-us_foreign_policy | 0.3500 | 0.3200 | 0.3900 | 0.4200 | 0.5100 | 0.6600 | 0.4300 | 0.6500 | 0.6700 | 0.6100 | 0.7200 | 0.8500 | 0.7700 | 0.8000 | 0.8300 | 0.9100 |
ACC-virology | 0.3494 | 0.2530 | 0.3494 | 0.3554 | 0.3735 | 0.4819 | 0.3253 | 0.4217 | 0.4277 | 0.4398 | 0.4096 | 0.4458 | 0.4940 | 0.4639 | 0.5000 | 0.5361 |
ACC-world_religions | 0.3158 | 0.3041 | 0.4035 | 0.3333 | 0.6140 | 0.5614 | 0.4912 | 0.6842 | 0.6842 | 0.6550 | 0.6491 | 0.7602 | 0.7427 | 0.7719 | 0.7953 | 0.8538 |
0.3022 | 0.2747 | 0.3053 | 0.2818 | 0.3515 | 0.5018 | 0.3569 | 0.4682 | 0.4662 | 0.4258 | 0.4559 | 0.5482 | 0.5340 | 0.5580 | 0.5741 | 0.6932 |