Adding verbal-reasoning-challenge as a Community Task #551

aryawu0513 · 2025-02-11T20:35:35Z

I am hoping to contribute verbal-reasoning-challenege as a new community tasks. Authored with @arjunguha

The Verbal Reasoning Challenge is a dataset designed to evaluate the reasoning abilities of Large Language Models. It is based on the "off-air challenges" from the NPR Sunday Puzzle Challenge, which are designed to be understandable by any adult in the United States.

Link to the paper: https://arxiv.org/abs/2502.01584
Link to the dataset: https://huggingface.co/datasets/nuprl/verbal-reasoning-challenge

arjunguha · 2025-02-11T21:36:57Z

community_tasks/verbal_reasoning_challenge.py

+    evaluation_splits=["test"],
+    few_shots_split=None,
+    few_shots_select=None,
+    generation_size=2048,


I assume this can be changed? Very low for a reasoning model.

This can be changed at runtime when using vllm or litellm models yes !

NathanHB

Thanks for the contribution ! It looks good, is it possible to share results to see if they match the ones in the paper ?

NathanHB · 2025-02-12T12:49:26Z

community_tasks/verbal_reasoning_challenge.py

+
+
+def _answer_without_thoughts(completion: str) -> str:
+    if "<think>" not in completion[:200]:


why check the first 200 characters only ?

We are discussing this. Our expectation is that the <think> token should come early -- ideally immediately after the prompt. There are three cases:

No <think> tags. We find that Gemini Thinking sometimes does not produce reasoning tokens even when ask for them. So, if we don't find <think>, we just return the whole response.

There is a <think> but no </think>: model is stuck "thinking forever" so we return "" below.

We find a <think> and </think>: we just return the text after </think>. This is the normal case.

aryawu0513 · 2025-02-14T17:27:16Z

Thanks for the contribution ! It looks good, is it possible to share results to see if they match the ones in the paper ?

Yes! I ran GPT-4o on the dataset using LightEval framework, and the result (37/582 ≈ 0.06) is consistent with the paper’s reported performance (36/582 ≈ 0.06). The questions that GPT-4o answered correctly are largely consistent with it's result displayed on the huggingface space, with some variation due to sampling.

Adding verbal-reasoning-challenge as a Community Task

70bbef9

arjunguha reviewed Feb 11, 2025

View reviewed changes

NathanHB reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding verbal-reasoning-challenge as a Community Task #551

Adding verbal-reasoning-challenge as a Community Task #551

aryawu0513 commented Feb 11, 2025

arjunguha Feb 11, 2025

NathanHB Feb 12, 2025 •

edited

Loading

NathanHB left a comment

NathanHB Feb 12, 2025

arjunguha Feb 14, 2025

aryawu0513 commented Feb 14, 2025



		def _answer_without_thoughts(completion: str) -> str:
		if "<think>" not in completion[:200]:

Adding verbal-reasoning-challenge as a Community Task #551

Are you sure you want to change the base?

Adding verbal-reasoning-challenge as a Community Task #551

Conversation

aryawu0513 commented Feb 11, 2025

arjunguha Feb 11, 2025

Choose a reason for hiding this comment

NathanHB Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

NathanHB left a comment

Choose a reason for hiding this comment

NathanHB Feb 12, 2025

Choose a reason for hiding this comment

arjunguha Feb 14, 2025

Choose a reason for hiding this comment

aryawu0513 commented Feb 14, 2025

NathanHB Feb 12, 2025 •

edited

Loading