Inference Acceleration Frameworks

Inference Acceleration Frameworks#

RQA systems need time to perform retrieval and generate a response. If you are using large models, this latency can be high. This may not be acceptable for interactive chat systems, as in our example in Interactive Chat. To address this, we provide support for multiple acceleration frameworks that can be used to reduce the latency of the RQA system. This includes:

To speed up retrieval:

To speed up text generation:

By default, our SimpleRQA class (in Quickstart and in Interactive Chat) uses FAISS for retrieval and no acceleration framework for text generation. However, you can easily drop in any of the above acceleration frameworks in a two-step process:

  1. host your model using the acceleration framework of your choice, for instance, vLLM:

    python -m vllm.entrypoints.api_server --model lmsys/vicuna-7b-v1.5
    

    this should by default host the model at http://localhost:8000.

  2. Change the --qa_model_name_or_path argument to <framework-name>::<url>/generate

Accelerating SimpleRQA#

The SimpleRQA class is used in many contexts, such as during evaluation and model serving. Since the procedures are very similar for each framework, we will use SGLang in this example:

  1. Use SGLang to host your generative model:

    python -m sglang.launch_server --model-path lmsys/vicuna-7b-v1.5 --port 30000
    
  2. Change the --qa_model_name_or_path argument to <framework-name>::<url>/generate. For example, when you are evaluating your model:

    python scripts/test/test_e2e.py \
    # --qa_model_name_or_path lmsys/vicuna-7b-v1.5 \
    --qa_model_name_or_path sglang::http://localhost:3000/generate
    --embedding_model_name_or_path intfloat/e5-base-v2 \
    --document_path <example/documents.pkl> \
    --index_path <example/index> \
    --eval_data_path <example/test_w_qa.jsonl> \
    --output_dir <example/output/dir>
    

Accelerating Model Serving#

You can also use this <framework-name>::<url>/generate in our serving scripts, such as in Interactive Chat. Since the procedures are very similar for each framework, we will use vLLM as an example:

  1. Use vLLM to host your generative model:

    python -m vllm.entrypoints.api_server --model lmsys/vicuna-7b-v1.5
    

    this should by default host the model at http://localhost:8000.

  2. Change the --qa_model_name_or_path argument to <framework-name>::<url>/generate:

    export CUDA_VISIBLE_DEVICES=0
    python local_rqa/serve/model_worker.py \
    --document_path <example/documents.pkl> \
    --index_path <example/index> \
    --embedding_model_name_or_path intfloat/e5-base-v2 \
    # --qa_model_name_or_path lmsys/vicuna-7b-v1.5 \
    --qa_model_name_or_path vllm::http://localhost:8000/generate \
    --model_id simple_rqa