OpenVINO Model Server

Configure OpenVINO Model Server with Continue to use Intel-optimized models for CPU, iGPU, GPU and NPU via the OpenAI-compatible API, supporting code completion with models like CodeLlama and Qwen

OpenVINO™ Mode Server is scalable inference server for models optimized with OpenVINO™ for Intel CPU, iGPU, GPU and NPU.
OpenVINO™ Mode Server supports text generation via OpenAI Chat Completions API. Simply select OpenAI provider to point apiBase to running OVMS instance. Refer to this demo on official OVMS documentation to easily set up your own local server.
Example configuration once OVMS is launched:
name: My Config
version: 0.0.1
schema: v1

models:
  - name: OVMS CodeLlama-7b-Instruct-hf
    provider: openai
    model: codellama/CodeLlama-7b-Instruct-hf
    apiKey: unused
    apiBase: http://localhost:5555/v3
    roles:
      - chat
      - edit
      - apply
  - name: OVMS Qwen2.5-Coder-1.5B
    provider: openai
    model: Qwen/Qwen2.5-Coder-1.5B
    apiKey: unused
    apiBase: http://localhost:5555/v3
    roles:
      - autocomplete