vLLM is an open-source library for fast LLM inference which typically is used to serve multiple users at the same time. It can also be used to run a large model on multiple GPU:s (e.g. when it doesn´t fit in a single GPU). Run their OpenAI-compatible server using vllm serve. See their server documentation and the engine arguments documentation.
Continue automatically handles vLLM’s response format (which uses results instead of data).Click here to see a list of reranking model providers.The continue implementation uses OpenAI under the hood. View the source