End-to-End Evaluation#
To test the end-to-end performance of an RQA system, we provide an automatic evaluation script that measures:
retrieval performance such as Recall@k;
generation performance such as BLEU, ROUGE and GPT-4 Eval;
end-to-end metrics such as runtime.
These are often often used in open-ended generation tasks such as machine translation and summarization. GPT-4 Eval is a recent method using GPT-4 (OpenAI, 2023) to evaluate the quality of model-generated responses (Liu et al., 2023; Zheng et al., 2023).
Running Evaluation#
By default, our evaluation script scripts/test/test_e2e.py is based on SimpleRQA class, which is compatible with most models from huggingface and OpenAI. To use this script, you will need 1) an embedding model and a QA model, 2) a document/index database, 3) a test dataset. For example:
python scripts/test/test_e2e.py \
--qa_model_name_or_path lmsys/vicuna-7b-v1.5 \
--embedding_model_name_or_path intfloat/e5-base-v2 \
--document_path <example/documents.pkl> \
--index_path <example/index> \
--eval_data_path <example/test_w_qa.jsonl> \
--gen_gpt4eval false \
--output_dir <example/output/dir>
this will output a JSONL file containing the evaluation results saved under <example/output/dir>. The folder will have the following files after evaluation:
<example/output/dir>
├── all_args.json # arguments used for evaluation
├── score.json # models performance
├── test-predictions.jsonl # models predictions/outputs
└─- test.log # test logs
Note that:
currently, we assume evaluation on a single GPU. If you want to use a large QA model with multi-GPU, please use Inference Acceleration Frameworks to serve the model and then use the
--qa_model_name_or_pathargument to specify the endpoint.the
document_pathandindex_pathrefer to the document database and the indexed folder. Ifindex_pathis empty, the script will also index the document database and save it to the specified path.by default,
gen_gpt4evalis set tofalse. If you want to use GPT-4 Eval, you should make sure you have configuredexport OPENAI_API_KEY=xxxandexport OPENAI_ORGANIZATION=xxx. Then, set--gen_gpt4eval true.the
test-predictions.jsonlfile can then be directly used with the Static Human Evaluation module!for other available arguments, run
python scripts/test/test_e2e.py -h.
Customization#
You can customize the behavior of the evaluation script by either modifying scripts/test/test_e2e.py, or using the evaluators in your own code. Our evaluation procedure simply consists of three steps:
from local_rqa.pipelines.retrieval_qa import SimpleRQA
from local_rqa.evaluation.evaluator import E2EEvaluator, EvaluatorConfig
def test(model_args: ModelArguments, test_args: TestArguments):
### 1. init rqa model. This returns a SimpleRQA object
rqa_model = init_rqa_model(model_args, test_args.document_path, test_args.index_path)
### 2. define what metrics to use, and other configurations during evaluation
eval_config = EvaluatorConfig(
batch_size = test_args.batch_size,
retr_latency = False,
gen_f1 = True,
gen_precision = True,
gen_rouge = True,
gen_latency = True,
gen_gpt4eval = test_args.gen_gpt4eval,
e2e_latency = True,
## eval model related configs
assistant_prefix = model_args.assistant_prefix,
user_prefix = model_args.user_prefix,
sep_user = model_args.sep_user,
sep_sys = model_args.sep_sys,
)
### 3. load evaluation data, and run evaluation
loaded_eval_data = load_eval_data(test_args.eval_data_path)
evaluator = E2EEvaluator(
config=eval_config,
test_data=loaded_eval_data,
)
performance, predictions = evaluator.evaluate(rqa_model, prefix='test')
# other code omitted
return
Note
Under the hood, the E2EEvaluator class takes in any RQA system that subclasses the RQAPipeline class (e.g., our SimpleRQA). So if you wish to use a custom RQA system, you can first subclass RQAPipeline or even SimpleRQA, and then simply pass it to the E2EEvaluator.evaluate!
References:
OpenAI. 2023. GPT-4. https://openai.com/gpt-4.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023a. Judging llm-as-a-judge with mt-bench and chatbot arena.