Benchmark | Model | agieval | bigbench | gpt4all | truthfulqa | Average |
---|---|---|---|---|---|---|
nous | bert-tiny-finetuned-sms-spam-detection | 22.95 | 28.75 | 36.07 | 48.09 | 33.97 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 20.87 | ± | 2.55 |
acc_norm | 21.26 | ± | 2.57 | ||
agieval_logiqa_en | 0 | acc | 22.43 | ± | 1.64 |
acc_norm | 26.88 | ± | 1.74 | ||
agieval_lsat_ar | 0 | acc | 19.13 | ± | 2.60 |
acc_norm | 20.87 | ± | 2.69 | ||
agieval_lsat_lr | 0 | acc | 14.90 | ± | 1.58 |
acc_norm | 21.57 | ± | 1.82 | ||
agieval_lsat_rc | 0 | acc | 21.19 | ± | 2.50 |
acc_norm | 15.61 | ± | 2.22 | ||
agieval_sat_en | 0 | acc | 26.70 | ± | 3.09 |
acc_norm | 26.21 | ± | 3.07 | ||
agieval_sat_en_without_passage | 0 | acc | 25.24 | ± | 3.03 |
acc_norm | 25.73 | ± | 3.05 | ||
agieval_sat_math | 0 | acc | 24.09 | ± | 2.89 |
acc_norm | 25.45 | ± | 2.94 |
Average: 22.95%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 47.89 | ± | 3.63 |
bigbench_date_understanding | 0 | multiple_choice_grade | 9.49 | ± | 1.53 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 30.23 | ± | 2.86 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 10.03 | ± | 1.59 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 19.60 | ± | 1.78 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 14.57 | ± | 1.33 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 33.67 | ± | 2.73 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 27.80 | ± | 2.01 |
bigbench_navigate | 0 | multiple_choice_grade | 48.90 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 13.45 | ± | 0.76 |
bigbench_ruin_names | 0 | multiple_choice_grade | 54.24 | ± | 2.36 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 16.83 | ± | 1.18 |
bigbench_snarks | 0 | multiple_choice_grade | 45.86 | ± | 3.71 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 49.70 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 27.10 | ± | 1.41 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 20.16 | ± | 1.14 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 14.29 | ± | 0.84 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 33.67 | ± | 2.73 |
Average: 28.75%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 22.18 | ± | 1.21 |
acc_norm | 25.85 | ± | 1.28 | ||
arc_easy | 0 | acc | 25.08 | ± | 0.89 |
acc_norm | 24.03 | ± | 0.88 | ||
boolq | 1 | acc | 47.86 | ± | 0.87 |
hellaswag | 0 | acc | 25.69 | ± | 0.44 |
acc_norm | 26.31 | ± | 0.44 | ||
openbookqa | 0 | acc | 17.40 | ± | 1.70 |
acc_norm | 29.60 | ± | 2.04 | ||
piqa | 0 | acc | 52.12 | ± | 1.17 |
acc_norm | 49.40 | ± | 1.17 | ||
winogrande | 0 | acc | 49.41 | ± | 1.41 |
Average: 36.07%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 22.52 | ± | 1.46 |
mc2 | 48.09 | ± | 1.59 |
Average: 48.09%
Average score: 33.97%
Metadata: {'elapsed_time': '02:32:24', 'gpu': 'NVIDIA GeForce RTX 4090'}