NoLiMa: Long-Context Evaluation Beyond Literal Matching

This repository contains the code and data associated with our paper, "NoLiMa: Long-Context Evaluation Beyond Literal Matching".

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts ($<$1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

Results

Models	Claimed Length	Effective Length	Base Score (×0.85: Thr.)	1K	2K	4K	8K	16K	32K
GPT-4o	128K	8K	99.3 (84.4)	98.1	98.0	95.7	89.2	81.6	69.7
Llama 3.3 70B	128K	2K	97.3 (82.7)	94.2	87.4	81.5	72.1	59.5	42.7
Llama 3.1 405B	128K	2K	94.7 (80.5)	89.0	85.0	74.5	60.1	48.4	38.0
Llama 3.1 70B	128K	2K	94.5 (80.3)	91.0	81.8	71.2	62.7	51.8	43.2
Gemini 1.5 Pro	2M	2K	92.6 (78.7)	86.4	82.7	75.4	63.9	55.5	48.2
Jamba 1.5 Mini	256K	<1K	92.4 (78.6)	76.3	74.1	70.8	62.2	52.7	43.6
Command R+	128K	<1K	90.9 (77.3)	77.0	73.5	66.3	39.5	21.3	7.4
Mistral Large 2	128K	2K	87.9 (74.7)	86.1	85.5	73.3	51.5	32.6	18.7
Claude 3.5 Sonnet	200K	4K	87.6 (74.4)	85.4	84.0	77.6	61.7	45.7	29.8
Gemini 1.5 Flash	1M	<1K	84.7 (72.0)	68.6	61.6	51.0	44.4	35.5	28.6
GPT-4o mini	128K	<1K	84.9 (72.2)	67.7	58.2	44.1	32.6	20.6	13.7
Llama 3.1 8B	128K	1K	76.7 (65.2)	65.7	54.4	44.1	31.9	22.6	14.2

This table presents the performance results of selected models on NOLIMA tests. The base score represents a model’s accuracy on the task at short contexts (250, 500, and 1K) and serves as a controlled reference to measure performance degradation at longer contexts. The effective length is defined as the longest context where a model maintains at least 85% of its base score. Scores above this threshold are underlined, while scores dropping below 50% of the base score are italicized.

NoLiMa-Hard Results

Models	Base Score	4K	8K	16K	32K
Llama 3.3 70B
- w/o CoT	98.3	55.5	37.2	16.7	8.9
- w/ CoT	97.1	73.0	51.2	31.8	10.1
Reasoning Models
GPT-o1	99.9	92.0	78.0	60.1	31.1
GPT-o3 Mini	98.8	52.8	36.9	25.5	18.9
DeepSeek R1-Distill-Llama-70B	99.9	91.4	75.5	49.4	20.7

This table presents the performance results of selected reasoning models on NoLiMa-Hard, a subset of the original NoLiMa needle set containing the 10 most challenging question-needle pairs from previous evaluations. Scores dropping below 50% of the base score are in italic.

Model Evaluation Instructions

Below are the general steps to evaluate models, whether serving them locally or using an API-based service.

1. Installing Requirements

Install the required packages by running:

pip install -r requirements.txt

2. Downloading Data

Download the NoLiMa dataset by running:

data/download_NoLiMa_data.sh

The needle set and haystack data will be downloaded to the data directory from our HuggingFace Datasets 🤗 repository.

3A. Locally Served Models

Start the model server (optional)
- For example, to serve the Meta Llama 3.3 (70B) model across 8 GPUs:
```
evaluation/vllm_serve.sh --model_name meta-llama/Llama-3.3-70B-Instruct --num_gpus 8
```
- This script uses a tensor parallel configuration by default. Modify it as needed.
Create or modify a local model configuration
- Use llama_3.3_70b.json in the evaluation/model_configs folder as a reference. Note that this configuaration file is used in the evaluation script (not for the vllm serve).

3B. API-Based Models

Create or modify a model configuration for your API-based service
- For example, use the existing config templates in the evaluation/model_configs folder.
- Note that some APIs may require additional credentials or authentication (AWS, Google Auth, etc.).

4. Common Steps for Both Approaches

Prepare test configuration files
- Add or modify configuration files in the evaluation/run_config directory.
- Ensure they reference the correct model config file from evaluation/model_configs.
Run the evaluations
```
evaluation/run_tests.sh
```
Collect the results
- All outputs are automatically saved to the results directory specified in each run_config file.
Gathering the results
- Using the evaluation/gather_results.ipynb notebook, you can easily gather the results from the output files and generate a csv file containing the accuracy of each test.

Additional Notes

You can find various needle sets (e.g., CoT-style, multiple choice, direct, distractor-included) in data/needlesets. Adjust any paths or configurations as needed for your specific environment.

Haystack Filtering Pipeline

To replicate our evaluation results, you can directly use the shuffled texts available in the data/haystack/rand_shuffle directory. If you prefer to generate your own shuffled texts or run the full processing pipeline from scratch, refer to the data/README.md file for more information.

Cite

If you use the NoLiMa dataset, filtering pipeline, or code from this repository, please cite the paper:

@misc{modarressi2025nolimalongcontextevaluationliteral,
      title={NoLiMa: Long-Context Evaluation Beyond Literal Matching}, 
      author={Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Trung Bui and Ryan A. Rossi and Seunghyun Yoon and Hinrich Schütze},
      year={2025},
      eprint={2502.05167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05167}, 
}

License

The evaluation code and needle set data is licensed under the Adobe Research License. The license prohibits commercial use and allows non-commercial research use. For details about the haystack data, please refer to the data/haystack/LICENSES.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
evaluation		evaluation
.DS_Store		.DS_Store
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Abstract

Results

NoLiMa-Hard Results

Model Evaluation Instructions

1. Installing Requirements

2. Downloading Data

3A. Locally Served Models

3B. API-Based Models

4. Common Steps for Both Approaches

Additional Notes

Haystack Filtering Pipeline

Cite

License

About

Releases

Packages

Contributors 2

Languages

License

adobe-research/NoLiMa

Folders and files

Latest commit

History

Repository files navigation

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Abstract

Results

NoLiMa-Hard Results

Model Evaluation Instructions

1. Installing Requirements

2. Downloading Data

3A. Locally Served Models

3B. API-Based Models

4. Common Steps for Both Approaches

Additional Notes

Haystack Filtering Pipeline

Cite

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages