vLLM - Continue

vLLM is an open-source library for fast LLM inference which typically is used to serve multiple users at the same time. It can also be used to run a large model on multiple GPU:s (e.g. when it doesn´t fit in a single GPU). Run their OpenAI-compatible server using vllm serve. See their server documentation and the engine arguments documentation.

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct

Chat model

We recommend configuring Llama3.1 8B as your chat model.

config.yaml

models:
  - name: Llama3.1 8B Instruct
    provider: vllm
    model: meta-llama/Meta-Llama-3.1-8B-Instruct
    apiBase: http://<vllm chat endpoint>/v1

Autocomplete model

We recommend configuring Qwen2.5-Coder 1.5B as your autocomplete model.

config.yaml

models:
  - name: Qwen2.5-Coder 1.5B
    provider: vllm
    model: Qwen/Qwen2.5-Coder-1.5B
    apiBase: http://<vllm autocomplete endpoint>/v1
    roles:
      - autocomplete

Embeddings model

We recommend configuring Nomic Embed Text as your embeddings model.

config.yaml

models:
  - name: VLLM Nomad Embed Text 
    provider: vllm
    model: nomic-ai/nomic-embed-text-v1
    apiBase: http://<vllm embed endpoint>/v1
    roles:
      - embed

Reranking model

Continue automatically handles vLLM’s response format (which uses results instead of data). Click here to see a list of reranking model providers. The continue implementation uses OpenAI under the hood. View the source

Customize

​Chat model

​Autocomplete model

​Embeddings model

​Reranking model

Chat model

Autocomplete model

Embeddings model

Reranking model