This repo contains the code for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs". We are still in the process to polish our repo.
- Introduction
- Installation
- Quick Start
- Corpus Preparation (Optional)
- Long Retriever
- Long Reader
- License
- Citation
In traditional RAG framework, the basic retrieval units are normally short. Such a design forces the retriever to search over a large corpus to find the "needle" unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal performance. We propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". Our framework use a 4K-token retrieval unit, which is 30x longer than before. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
Clone this repository and install the required packages:
git clone https://github.com/TIGER-AI-Lab/LongRAG.git
cd LongRAG
pip install -r requirements.txt
Please go to the "Long Reader" section and follow the instructions. This will help you get the final prediction for 100 examples.
The output will be similar to our sample files in the exp/
directory.
This is an optional step. You can use our processed corpus directly. We have released two versions of the retrieval corpus for NQ and HotpotQA on Hugging Face.
from datasets import load_dataset
corpus_nq = load_dataset("TIGER-Lab/LongRAG", "nq_corpus")
corpus_hotpotqa = load_dataset("TIGER-Lab/LongRAG", "hotpot_qa_corpus")
If you are still interested in how we craft the corpus, you can start reading here.
Wikipedia raw data clean: We first clean Wikipedia raw data by following the standard process. We use WikiExtractor. This is a widely-used Python script that extracts and cleans text from a Wikipedia database backup dump. Please ensure you use the required Python environment. A sample script is:
sh scripts/extract_and_clean_wiki_dump.sh
Preprocess Wikipedia data After cleaning the Wikipedia raw data, run the following script to gather more information.
sh scripts/process_wiki_page.sh
dir_path
: The directory path of the cleaned Wikipedia dump, which is the output of the previous step.output_path_dir
: The output directory will contain several pickle files, each representing a dictionary for the Wikipedia page.degree.pickle
: The key is the Wikipedia page title, and the value is the number of hyperlinks.abs_adj.pickle
: The key is the Wikipedia page title, and the value is the linked page in the abstract paragraph.full_adj.pickle
: The key is the Wikipedia page title, and the value is the linked page in the entire page.doc_size.pickle
: The key is the Wikipedia page title, and the value is the number of tokens on that page.doc_dict.pickle
: The key is the Wikipedia page title, and the value is the text of the page.corpus_title_path
: The key is used to filter the NQ dataset. In the original DPR paper, certain Wikipedia pages, such as list pages and disambiguation pages, were removed, reducing the total number of Wikipedia pages from 5 million to 3 million. For a fair comparison, we also chose to exclude these pages. (For HotpotQA, we did not remove any pages, so the number of Wikipedia pages remains at 5 million.) You can download the DPR's titles from this link.
We have provided the processed Wikipedia in our huggingface repo.
Please check out the nq_wiki
and hotpot_qa_wiki
subsets for more information. You could easily derive these
pickle files from these two datasets.
Retrieval Corpus: By grouping multiple related documents, we can construct long retrieval units with more than 4K tokens. This design could also significantly reduce the corpus size (number of retrieval units in the corpus). Then, the retriever’s task becomes much easier. Additionally, the long retrieval unit will also improve the information completeness to avoid ambiguity or confusion.
sh scripts/group_documents.sh
processed_wiki_dir
: The output directory of the above step.mode
:abs
is for HotpotQA corpus,full
is for NQ corpus.output_dir
: The output directory, The output directory will contain several pickle files, each representing a dictionary for the retrieval corpus. The most important one isgroup_text.pickle
, which maps the corpus ID to the corpus text. For more details, please refer to our released corpus on Hugging Face.
We leverage open-sourced dense retrieval toolkit, Tevatron. For all our retrieval experiments. The base embedding model we used is bge-large-en-v1.5. We have provided a sample script; make sure to update the parameters with your own dataset local path. Additionally, our script uses 4 GPUs to encode the corpus for time saving; please update this based on your own use case.
sh scripts/run_retrieve_tevatron.sh
We select Gemini-1.5-Pro and GPT-4o as our long reader given their strong ability to handle long context input. (We also plan to test other LLMs capable of handling long contexts in the future.)
The input of the reader is a concatenation of all the long retrieval units from the long retriever. We have provided the input file in our Huggingface repo.
mkdir -p exp/
sh scripts/run_eval_qa.sh
test_data_name
: Test set name,nq
(NQ) orhotpot_qa
(HotpotQA).test_data_split
: For each test set, there are three splits:full
,subset_1000
,subset_100
. We suggest starting withsubset_100
for a quick start or debugging and usingsubset_1000
to obtain relatively stable results.output_file_path
: The output file, here it's placed in theexp/
directory.reader_model
: The long context reader model we use, currently our code supportGPT-4o
,GPT-4-Turbo
,Gemini-1.5-Pro
,Claude-3-Opus
. Please note that you need to update the related API key and API configuration in the code. For example, if you are using the GPT-4 series, you need to configure the code inutils/gpt_inference.py
; if you are using the Gemini series, you need to configure the code inutils/gemini_inference.py
. We will continue to support more models in the future.
The output file contains one test case per row. The short_ans
field is our final prediction.
{
"query_id": "383",
"question": "how many episodes of touching evil are there",
"answers": ["16"],
"long_ans": "16 episodes.",
"short_ans": "16",
"is_exact_match": 1,
"is_substring_match": 1,
"is_retrieval": 1
}
We have provided some sample output files in our exp/
directory. For example, exp/nq_gpt4o_100.json
contains
the result from the running file:
python eval/eval_qa.py \
--test_data_name "nq" \
--test_data_split "subset_100" \
--output_file_path "./exp/nq_gpt4o_100.json" \
--reader_model "GPT-4o"
The top-1 retrieval accuracy is 88%, and the exact match rate is 64%.
Please check out the license of each subset we use in our work.
Dataset Name | License Type |
---|---|
NQ | Apache License 2.0 |
HotpotQA | CC BY-SA 4.0 License |
Please kindly cite our paper if you find our project is useful:
@article{jiang2024longrag,
title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
journal={arXiv preprint arXiv:2406.15319},
year={2024},
url={https://arxiv.org/abs/2406.15319}
}