This repository contains the code and data associated with our paper, "NoLiMa: Long-Context Evaluation Beyond Literal Matching".
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts ($<$1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
Models | Claimed Length | Effective Length | Base Score (×0.85: Thr.) |
1K | 2K | 4K | 8K | 16K | 32K |
---|---|---|---|---|---|---|---|---|---|
GPT-4o | 128K | 8K | 99.3 (84.4) | 98.1 | 98.0 | 95.7 | 89.2 | 81.6 | 69.7 |
Llama 3.3 70B | 128K | 2K | 97.3 (82.7) | 94.2 | 87.4 | 81.5 | 72.1 | 59.5 | 42.7 |
Llama 3.1 405B | 128K | 2K | 94.7 (80.5) | 89.0 | 85.0 | 74.5 | 60.1 | 48.4 | 38.0 |
Llama 3.1 70B | 128K | 2K | 94.5 (80.3) | 91.0 | 81.8 | 71.2 | 62.7 | 51.8 | 43.2 |
Gemini 1.5 Pro | 2M | 2K | 92.6 (78.7) | 86.4 | 82.7 | 75.4 | 63.9 | 55.5 | 48.2 |
Jamba 1.5 Mini | 256K | <1K | 92.4 (78.6) | 76.3 | 74.1 | 70.8 | 62.2 | 52.7 | 43.6 |
Command R+ | 128K | <1K | 90.9 (77.3) | 77.0 | 73.5 | 66.3 | 39.5 | 21.3 | 7.4 |
Mistral Large 2 | 128K | 2K | 87.9 (74.7) | 86.1 | 85.5 | 73.3 | 51.5 | 32.6 | 18.7 |
Claude 3.5 Sonnet | 200K | 4K | 87.6 (74.4) | 85.4 | 84.0 | 77.6 | 61.7 | 45.7 | 29.8 |
Gemini 1.5 Flash | 1M | <1K | 84.7 (72.0) | 68.6 | 61.6 | 51.0 | 44.4 | 35.5 | 28.6 |
GPT-4o mini | 128K | <1K | 84.9 (72.2) | 67.7 | 58.2 | 44.1 | 32.6 | 20.6 | 13.7 |
Llama 3.1 8B | 128K | 1K | 76.7 (65.2) | 65.7 | 54.4 | 44.1 | 31.9 | 22.6 | 14.2 |
This table presents the performance results of selected models on NOLIMA tests. The base score represents a model’s accuracy on the task at short contexts (250, 500, and 1K) and serves as a controlled reference to measure performance degradation at longer contexts. The effective length is defined as the longest context where a model maintains at least 85% of its base score. Scores above this threshold are underlined, while scores dropping below 50% of the base score are italicized.
Models | Base Score | 4K | 8K | 16K | 32K |
---|---|---|---|---|---|
Llama 3.3 70B | |||||
- w/o CoT | 98.3 | 55.5 | 37.2 | 16.7 | 8.9 |
- w/ CoT | 97.1 | 73.0 | 51.2 | 31.8 | 10.1 |
Reasoning Models | |||||
GPT-o1 | 99.9 | 92.0 | 78.0 | 60.1 | 31.1 |
GPT-o3 Mini | 98.8 | 52.8 | 36.9 | 25.5 | 18.9 |
DeepSeek R1-Distill-Llama-70B | 99.9 | 91.4 | 75.5 | 49.4 | 20.7 |
This table presents the performance results of selected reasoning models on NoLiMa-Hard, a subset of the original NoLiMa needle set containing the 10 most challenging question-needle pairs from previous evaluations. Scores dropping below 50% of the base score are in italic.
Below are the general steps to evaluate models, whether serving them locally or using an API-based service.
Install the required packages by running:
pip install -r requirements.txt
Download the NoLiMa dataset by running:
data/download_NoLiMa_data.sh
The needle set and haystack data will be downloaded to the data
directory from our HuggingFace Datasets 🤗 repository.
-
Start the model server (optional)
- For example, to serve the Meta Llama 3.3 (70B) model across 8 GPUs:
evaluation/vllm_serve.sh --model_name meta-llama/Llama-3.3-70B-Instruct --num_gpus 8
- This script uses a tensor parallel configuration by default. Modify it as needed.
- For example, to serve the Meta Llama 3.3 (70B) model across 8 GPUs:
-
Create or modify a local model configuration
- Use
llama_3.3_70b.json
in theevaluation/model_configs
folder as a reference. Note that this configuaration file is used in the evaluation script (not for the vllm serve).
- Use
- Create or modify a model configuration for your API-based service
- For example, use the existing config templates in the
evaluation/model_configs
folder. - Note that some APIs may require additional credentials or authentication (AWS, Google Auth, etc.).
- For example, use the existing config templates in the
-
Prepare test configuration files
- Add or modify configuration files in the
evaluation/run_config
directory. - Ensure they reference the correct model config file from
evaluation/model_configs
.
- Add or modify configuration files in the
-
Run the evaluations
evaluation/run_tests.sh
-
Collect the results
- All outputs are automatically saved to the results directory specified in each run_config file.
-
Gathering the results
- Using the
evaluation/gather_results.ipynb
notebook, you can easily gather the results from the output files and generate a csv file containing the accuracy of each test.
- Using the
You can find various needle sets (e.g., CoT-style, multiple choice, direct, distractor-included) in data/needlesets
.
Adjust any paths or configurations as needed for your specific environment.
To replicate our evaluation results, you can directly use the shuffled texts available in the data/haystack/rand_shuffle
directory. If you prefer to generate your own shuffled texts or run the full processing pipeline from scratch, refer to the data/README.md
file for more information.
If you use the NoLiMa dataset, filtering pipeline, or code from this repository, please cite the paper:
@misc{modarressi2025nolimalongcontextevaluationliteral,
title={NoLiMa: Long-Context Evaluation Beyond Literal Matching},
author={Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Trung Bui and Ryan A. Rossi and Seunghyun Yoon and Hinrich Schütze},
year={2025},
eprint={2502.05167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.05167},
}
The evaluation code and needle set data is licensed under the Adobe Research License. The license prohibits commercial use and allows non-commercial research use. For details about the haystack data, please refer to the data/haystack/LICENSES.md file.