Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding verbal-reasoning-challenge as a Community Task #551

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aryawu0513
Copy link

I am hoping to contribute verbal-reasoning-challenege as a new community tasks. Authored with @arjunguha

The Verbal Reasoning Challenge is a dataset designed to evaluate the reasoning abilities of Large Language Models. It is based on the "off-air challenges" from the NPR Sunday Puzzle Challenge, which are designed to be understandable by any adult in the United States. 

Link to the paper: https://arxiv.org/abs/2502.01584
Link to the dataset: https://huggingface.co/datasets/nuprl/verbal-reasoning-challenge

evaluation_splits=["test"],
few_shots_split=None,
few_shots_select=None,
generation_size=2048,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this can be changed? Very low for a reasoning model.

Copy link
Member

@NathanHB NathanHB Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed at runtime when using vllm or litellm models yes !

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution ! It looks good, is it possible to share results to see if they match the ones in the paper ?



def _answer_without_thoughts(completion: str) -> str:
if "<think>" not in completion[:200]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check the first 200 characters only ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are discussing this. Our expectation is that the <think> token should come early -- ideally immediately after the prompt. There are three cases:

  1. No <think> tags. We find that Gemini Thinking sometimes does not produce reasoning tokens even when ask for them. So, if we don't find <think>, we just return the whole response.
  2. There is a <think> but no </think>: model is stuck "thinking forever" so we return "" below.
  3. We find a <think> and </think>: we just return the text after </think>. This is the normal case.

@aryawu0513
Copy link
Author

Thanks for the contribution ! It looks good, is it possible to share results to see if they match the ones in the paper ?

Yes! I ran GPT-4o on the dataset using LightEval framework, and the result (37/582 ≈ 0.06) is consistent with the paper’s reported performance (36/582 ≈ 0.06). The questions that GPT-4o answered correctly are largely consistent with it's result displayed on the huggingface space, with some variation due to sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants