-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding verbal-reasoning-challenge as a Community Task #551
base: main
Are you sure you want to change the base?
Adding verbal-reasoning-challenge as a Community Task #551
Conversation
evaluation_splits=["test"], | ||
few_shots_split=None, | ||
few_shots_select=None, | ||
generation_size=2048, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this can be changed? Very low for a reasoning model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed at runtime when using vllm or litellm models yes !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution ! It looks good, is it possible to share results to see if they match the ones in the paper ?
|
||
|
||
def _answer_without_thoughts(completion: str) -> str: | ||
if "<think>" not in completion[:200]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why check the first 200 characters only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are discussing this. Our expectation is that the <think>
token should come early -- ideally immediately after the prompt. There are three cases:
- No
<think>
tags. We find that Gemini Thinking sometimes does not produce reasoning tokens even when ask for them. So, if we don't find<think>
, we just return the whole response. - There is a
<think>
but no</think>
: model is stuck "thinking forever" so we return""
below. - We find a
<think>
and</think>
: we just return the text after</think>
. This is the normal case.
Yes! I ran GPT-4o on the dataset using LightEval framework, and the result (37/582 ≈ 0.06) is consistent with the paper’s reported performance (36/582 ≈ 0.06). The questions that GPT-4o answered correctly are largely consistent with it's result displayed on the huggingface space, with some variation due to sampling. |
I am hoping to contribute verbal-reasoning-challenege as a new community tasks. Authored with @arjunguha
The Verbal Reasoning Challenge is a dataset designed to evaluate the reasoning abilities of Large Language Models. It is based on the "off-air challenges" from the NPR Sunday Puzzle Challenge, which are designed to be understandable by any adult in the United States.
Link to the paper: https://arxiv.org/abs/2502.01584
Link to the dataset: https://huggingface.co/datasets/nuprl/verbal-reasoning-challenge