OpenVINO Model Server

OpenVINO™ Mode Server is scalable inference server for models optimized with OpenVINO™ for Intel CPU, iGPU, GPU and NPU.

OpenVINO™ Mode Server supports text generation via OpenAI Chat Completions API. Simply select OpenAI provider to point apiBase to running OVMS instance. Refer to this demo on official OVMS documentation to easily set up your own local server. Example configuration once OVMS is launched:

config.yaml

models:
  - name: OVMS CodeLlama-7b-Instruct-hf
    provider: openai
    model: codellama/CodeLlama-7b-Instruct-hf
    apiKey: unused
    apiBase: http://localhost:5555/v3
    roles:
      - chat
      - edit
      - apply
  - name: OVMS Qwen2.5-Coder-1.5B
    provider: openai
    model: Qwen/Qwen2.5-Coder-1.5B
    apiKey: unused
    apiBase: http://localhost:5555/v3
    roles:
      - autocomplete

Customize