Summary of a Haystack

This repository contains the data and code for the experiments in the SummHay EMNLP 2024 paper.

Figure: High-level diagram of the data synthesis process followed to generate Haystacks with accompanying queries, used to evaluate both RAG systems and long-context LLMs.

Table: Summary of a Haystack results of human performance, RAG systems, and Long-Context LLMs. Results are reported using three metrics: Coverage (left), Citation (center), and Joint (right) scores. `Full` corresponds to model performance when inputting the entire Haystack, whereas Rand, Vect, LongE, KWs, RR3, Orac correspond to retrieval components RAG systems. Models ranked by Oracle Joint Score. For each model, #W_b reports the average number of words per bullet point.

Accessing the Data

We publicly release the 10 Haystacks (5 in conversational domain, 5 in the news domain). You can access the Haystacks via huggingface datasets:

from datasets import load_dataset

dataset = load_dataset("Salesforce/summary-of-a-haystack")['train']

They are also provided in the data/ folder. There is one Haystack per .json file.

Each Haystack follows the following schema:

{
    "topic_id": "ObjectId()",
    "topic": "",
    "topic_metadata": {"participants": []}, // can be domain specific
    "subtopics": [
        {
            "subtopic_id": "ObjectId()",
            "subtopic_name": "",
            "subtopic": "",
            "insights": [
                {
                    "insight_id": "ObjectId()",
                    "insight_name": "",
                    "insight": ""
                }
            ],
            "query": "question reformulation of the subtopic",
            "retriever": {
                "retriever_method": {
                    "document_id": "0|1"
                }
            },
            "summaries": {
                "summarization_method_xyz": ["line1", "line2", "line3"],
                "{retriever}-{llm_summarizer}": ["line1", "line2", "line3"],
                "summarization_method_abc": ["line1", "line2", "line3"]
            },
            "eval_summaries": {
                "summarization_method_xyz": [
                    {
                        "insight_id": "",
                        "coverage": "NO_COVERAGE|PARTIAL_COVERAGE|FULL_COVERAGE",
                        "bullet_id": "line_number"
                    }
                ]
            }
        }
    ],
    "documents": [
        {
            "document_id": "ObjectId()",
            "document_text": "",
            "document_metadata": [], // domain specific information
            "insights_included": [] // list of insight_ids
        }
    ]
}

Running the Pipeline

The pipeline can be run with three consecutive scripts: (1) populate_retriever_scores.py (optional, if implementing a new retriever), (2) populate_summaries.py which populates the summary outputs, (3) populate_eval.py which generates the evaluation scores (using GPT-4o by default. Some notes:

In order to introduce a new retriever/summarizer, one should modify the generate functions (which currently map to our internal LLM API) to link to the generative system that should be evaluated.
We recommend keeping the prompts unmodified (they are provided in prompts/), but if you modify the prompt, we highly recommend stating so when reporting results. We did not perform extensive prompt engineering optimization in the results reported in the paper.
Each script has argparse arguments that can help with specific use.

An example of running on the pipeline might look like:

python populate_summaries.py --fn data/topic_news1.json --domain news --model_cards claude4 --full_sum --retrieval_summ
python populate_eval.py --fn data/topic_news1.json

The above would run the SummHay experiment (i.e., generate summaries) for all retrievers, and for the full-context settings, for a model named Claude4 (whose access would have to be implemented within populate_summaries.py), followed by the automatic evaluation on those summaries.

Visualizing Results on SummHay

The Results_Visualization.ipynb notebook contains the scripts that can be used to compile and visualize results, these are the exact scripts used to generate Table 2 in the paper.

Benchmarking Automatic Evaluation

Sections 4.2-4.3 of the paper mention an experimental setup to validate the automation of SummHay evaluation. We release the 200 annotated samples used for this experiment (see data/eval_benchmark_data.json), which contains the manual annotations, as well as predictions obtained from the evaluator models included in our experiments.

The Eval_Benchmarking.ipynb notebook includes the code used to compile and visualize the results (corresponding to Table 1 in the paper), and the bias estimates for the automatic metrics (corresponding to Table 4 in the paper's Appendix).

The release of the data can serve as a basis to evaluate future (potentially more efficient) methods for evaluation of the SummHay task.

Citation

@article{laban2024SummHay,
  title={Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems},
  author={Laban, Philippe and Fabbri, Alexander R and Xiong, Caiming and Wu, Chien-Sheng},
  journal={arXiv preprint arXiv:https://arxiv.org/pdf/2407.01370},
  year={2024}
}

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
prompts		prompts
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Eval_Benchmarking.ipynb		Eval_Benchmarking.ipynb
LICENSE		LICENSE
README.md		README.md
Results_Visualization.ipynb		Results_Visualization.ipynb
SECURITY.md		SECURITY.md
SummHay_Illustration.png		SummHay_Illustration.png
SummHay_Results_Table.png		SummHay_Results_Table.png
populate_eval.py		populate_eval.py
populate_retriever_scores.py		populate_retriever_scores.py
populate_summaries.py		populate_summaries.py
requirements.txt		requirements.txt
utils_eval.py		utils_eval.py
utils_misc.py		utils_misc.py
utils_topic.py		utils_topic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary of a Haystack

Table of Contents

Accessing the Data

Running the Pipeline

Visualizing Results on SummHay

Benchmarking Automatic Evaluation

Citation

Get Involved

About

Releases

Packages

Contributors 2

Languages

License

salesforce/summary-of-a-haystack

Folders and files

Latest commit

History

Repository files navigation

Summary of a Haystack

Table of Contents

Accessing the Data

Running the Pipeline

Visualizing Results on SummHay

Benchmarking Automatic Evaluation

Citation

Get Involved

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages