Skip to main content

OpenVINO™ Model Server

info

OpenVINO™ Mode Server is scalable inference server for models optimized with OpenVINO™ for Intel CPU, iGPU, GPU and NPU.

OpenVINO™ Mode Server supports text generation via OpenAI Chat Completions API. Simply select OpenAI provider to point apiBase to running OVMS instance. Refer to this demo on official OVMS documentation to easily set up your own local server.

Example configuration once OVMS is launched:

config.yaml
models:
- name: OVMS CodeLlama-7b-Instruct-hf
provider: openai
model: codellama/CodeLlama-7b-Instruct-hf
apiKey: unused
apiBase: http://localhost:5555/v3
roles:
- chat
- edit
- apply
- name: OVMS Qwen2.5-Coder-1.5B
provider: openai
model: Qwen/Qwen2.5-Coder-1.5B
apiKey: unused
apiBase: http://localhost:5555/v3
roles:
- autocomplete