Skip to main content

Evaluation with MMLU dataset

How to run

To run the MMLU benchmark with a specific model, execute the following command from the recipe directory (recipes/mmlu):

eole tools run_mmlu -c <model_inference_config>

For instance, following the llama3 recipe:

eole tools run_mmlu  -c ../llama3/llama-mmlu.yaml

Results

Note: Below are the legacy OpenNMT-py results. We might re-run and update everything at some point.

All evaluations below have been computed with the OpenNMT-py converted models.

The evaluation script is taken from the https://github.com/FranxYao/chain-of-thought-hub repo and modified to use the OpenNMT-py models

There is a difference compared to the original MMLU Hendrycks script.

We do not compare the logprobs of A, B, C, D to determine the answer, we actually decode the next token after the prompt.

When the model is Sentencepiece based the next token can be 'A', 'B', 'C', 'D' or any other token. When the model is BPE based the tokens will be ' A', ' B', ' C', ' D' because the leading space is encoded with the letter, We strip that space to compute the metric.

For 7B params models:

  • Llama7B score (35.25) matches both the Llama paper and the score reported by chain-of-thought-hub

  • Falcon7B is a little higher then the score reported by chain-of-thought-hub (0.2641)

  • I ran MPT7B with chain-of-thought-hub and found 28.46, again ours is a little higher.

  • There are major discrepancies between those scores and Open LLM leaderboard of HF for MPT, Falcon, Redpajama that are way higher on the leaderboard.

For 13B, 33B, 40B models, we score with the 4-bit loading option, hence for Llama13B a score slightly under the paper (46.9), same for 33B (paper is 57.8)

MPT7BRedpajama7BOpen Llama7BFalcon7Bxgen7BFlan-T5-3BLlama7BLlama-2-7BLlama-2-chat-7BOpen Llama13BLlama13BLlama-2-13BLlama-2-chat-13BFalcon40BLlama33BLlama-2-70B
ACC-all0.29580.27450.30070.27650.34680.49290.35250.45870.45690.41480.44720.54290.52170.54990.57010.6875
ACC-abstract_algebra0.22000.25000.30000.24000.29000.27000.25000.30000.31000.32000.28000.31000.35000.32000.37000.3900
ACC-anatomy0.29630.26670.33330.24440.31850.42960.38520.48150.42220.46670.48890.50370.50370.51110.51850.6296
ACC-astronomy0.29610.27630.25000.24340.33550.47370.34870.40790.48030.47370.46710.52630.54610.56580.61180.7895
ACC-business_ethics0.29000.29000.32000.19000.32000.68000.41000.53000.42000.41000.43000.55000.50000.55000.58000.6900
ACC-clinical_knowledge0.29430.32080.38870.30190.30570.52450.35850.46040.52080.41130.41890.58110.56980.61130.55470.7019
ACC-college_biology0.30560.31250.32640.21530.39580.44440.38190.47220.54170.41670.47220.56940.53470.63190.58330.8333
ACC-college_chemistry0.28000.27000.24000.23000.25000.34000.29000.34000.25000.28000.24000.39000.36000.41000.38000.5200
ACC-college_computer_science0.31000.31000.31000.30000.33000.36000.29000.34000.36000.40000.37000.46000.51000.47000.44000.6000
ACC-college_mathematics0.29000.25000.28000.29000.32000.29000.34000.38000.34000.32000.25000.30000.28000.35000.36000.3700
ACC-college_medicine0.28900.26590.31790.26590.34100.42770.32370.42200.41040.36990.42200.53180.44510.47980.53760.6532
ACC-college_physics0.21570.24510.18630.21570.23530.29410.24510.22550.24510.25490.18630.26470.31370.33330.31370.3333
ACC-computer_security0.31000.36000.38000.28000.39000.64000.45000.62000.54000.54000.63000.69000.67000.65000.68000.8100
ACC-conceptual_physics0.33620.27230.30640.31490.34890.40850.37020.41700.38720.35740.39150.45110.37870.41700.47230.6723
ACC-econometrics0.28950.23680.28950.26320.26320.28070.26320.26320.33330.30700.27190.28950.31580.32460.33330.4123
ACC-electrical_engineering0.28970.30340.30340.28280.38620.45520.24830.47590.43450.49660.38620.51720.51030.50340.46900.6276
ACC-elementary_mathematics0.26980.26460.26980.25930.27250.31480.26460.26720.28570.24870.24870.33600.33330.34130.34130.4180
ACC-formal_logic0.25400.40480.23810.19050.26190.33330.26190.26980.23810.30160.38890.34920.28570.34130.35710.5000
ACC-global_facts0.27000.32000.32000.31000.33000.36000.30000.32000.31000.29000.34000.32000.29000.33000.39000.4500
ACC-high_school_biology0.30970.24840.29680.26450.32900.56450.33870.50650.52580.42900.50650.67420.61940.65160.64190.8194
ACC-high_school_chemistry0.20200.26600.25120.25120.26110.33000.29560.37440.35470.33500.26600.42860.41380.41870.37930.5468
ACC-high_school_computer_science0.34000.27000.28000.32000.32000.51000.33000.40000.45000.27000.45000.55000.58000.60000.58000.7700
ACC-high_school_european_history0.34550.28480.34550.29090.38790.73330.46670.61210.58180.47270.61210.65450.66670.66670.71520.8121
ACC-high_school_geography0.37370.32830.33330.16670.36360.64140.33330.48990.59600.48990.50000.66160.66160.71210.72730.8636
ACC-high_school_government_and_politics0.37820.21240.35750.25910.43520.66320.46110.67360.66320.59590.64250.81350.76170.79270.81870.9430
ACC-high_school_macroeconomics0.38210.27180.35640.26150.33590.53590.34100.45130.41030.42820.42560.49230.47440.56410.55900.7308
ACC-high_school_mathematics0.27780.26670.24070.24810.23330.30740.26300.29630.25560.26670.25930.28890.30370.31110.27410.3630
ACC-high_school_microeconomics0.29410.30670.29410.28990.36970.51680.33190.44120.43280.43700.44540.56300.50420.55040.55880.7605
ACC-high_school_physics0.25830.26490.25170.31790.24500.29800.26490.31790.30460.29800.25170.34440.32450.29140.33110.3907
ACC-high_school_psychology0.28440.32290.35050.24400.47520.67710.47890.63120.64770.54860.58350.74130.72290.75410.75960.8752
ACC-high_school_statistics0.40280.24540.39810.18520.16200.36570.32410.27780.32410.25460.26850.47220.36110.46300.46760.6157
ACC-high_school_us_history0.28920.22550.31370.28920.41670.68630.32840.52450.67650.54900.53430.71080.68630.71080.76960.9069
ACC-high_school_world_history0.24890.27850.28690.29960.39660.66670.42620.62450.66670.51050.62870.70890.72150.68350.76370.8608
ACC-human_aging0.32740.16590.28700.42150.42600.56500.39910.56950.56950.51570.51120.65020.68160.71300.68610.7848
ACC-human_sexuality0.35110.25190.27480.29010.33590.58020.34350.56490.48850.49620.56490.60310.58780.67940.67180.8550
ACC-international_law0.38020.22310.36360.24790.50410.68600.52070.65290.56200.52070.68600.68600.78510.66120.76030.8595
ACC-jurisprudence0.37040.23150.34260.34260.40740.62040.41670.53700.58330.44440.47220.68520.70370.66670.65740.8148
ACC-logical_fallacies0.29450.26380.28830.26380.35580.63190.41720.50920.53990.48470.50310.65640.63190.65030.69940.7975
ACC-machine_learning0.31250.22320.23210.37500.25890.35710.27680.38390.33930.35710.33040.30360.34820.30360.37500.5089
ACC-management0.33010.28160.25240.28160.30100.67960.33010.56310.66990.52430.63110.73790.71840.71840.75730.8252
ACC-marketing0.31200.27350.37610.29490.53850.79060.46150.67950.72650.58970.70940.80770.78210.79490.83330.8932
ACC-medical_genetics0.31000.24000.27000.28000.36000.48000.37000.55000.50000.51000.51000.55000.57000.62000.61000.7400
ACC-miscellaneous0.30010.28990.36780.29760.53260.67820.42780.64500.66920.59000.62960.74070.74580.74710.77520.8557
ACC-moral_disputes0.29770.26590.32950.30920.36130.59830.41330.51160.51450.47980.45660.62720.58090.65030.65030.7572
ACC-moral_scenarios0.24360.24690.24690.24920.24250.24360.24250.23800.21450.27150.24800.34640.29270.26150.38550.4413
ACC-nutrition0.28100.29080.33010.25820.34310.48040.39220.49020.50980.37580.51630.61440.59800.64050.64710.7778
ACC-philosophy0.31830.28300.28300.28300.31510.51770.40510.60130.56590.46620.51450.66560.60770.63990.66560.7781
ACC-prehistory0.30560.32100.32100.31170.34880.52160.35190.49070.56790.52160.50930.64510.59260.59880.66670.8272
ACC-professional_accounting0.24470.28720.25530.29790.30500.37230.27300.35820.34750.30500.32270.38300.37590.42550.43260.5780
ACC-professional_law0.27840.27050.25230.24970.26470.39900.29730.35530.32660.30640.35660.40680.37220.42960.43420.5404
ACC-professional_medicine0.22060.20590.25000.31250.43750.44120.42650.51840.35290.38600.50000.52210.47060.61760.54410.7390
ACC-professional_psychology0.28760.29250.26960.26470.32030.45260.35460.44280.47390.36930.45750.53920.50650.55390.61440.7500
ACC-public_relations0.34550.31820.40910.33640.41820.59090.40910.52730.51820.52730.55450.63640.60910.63640.68180.7273
ACC-security_studies0.37960.28160.29390.31020.25310.65310.33060.49800.45710.42450.52240.61220.65310.67350.63670.8082
ACC-sociology0.22390.25870.24880.35320.48260.73630.47260.63180.57710.54730.64180.72640.72140.77610.77610.8955
ACC-us_foreign_policy0.35000.32000.39000.42000.51000.66000.43000.65000.67000.61000.72000.85000.77000.80000.83000.9100
ACC-virology0.34940.25300.34940.35540.37350.48190.32530.42170.42770.43980.40960.44580.49400.46390.50000.5361
ACC-world_religions0.31580.30410.40350.33330.61400.56140.49120.68420.68420.65500.64910.76020.74270.77190.79530.8538
0.30220.27470.30530.28180.35150.50180.35690.46820.46620.42580.45590.54820.53400.55800.57410.6932