Vllm on Home

Deploying Hermes Agent on OpenShift

Sun, 24 May 2026 00:00:00 +0000

Hermes Agent on OpenShift: Private by default, cloud access when needed - AI generated

Introduction

In this post, I want to describe how to deploy the Hermes Agent on OpenShift and wire it to a self-hosted model endpoint running on the same cluster. This is a direct continuation of two earlier posts: Deploying OpenShift on AWS, which covers getting a cluster into place, and the post on running the Red Hat AI Inference Server on OpenShift, which covers the model serving layer that Hermes will talk to.

If you want background on what Hermes Agent is and why it is worth running, the companion post Hermes Agent: A Personal AI That Gets More Useful Over Time covers that in more detail. This post focuses on the mechanics of getting it running on OpenShift.

Architecture

The setup connects two namespaces on the same cluster. The Red Hat AI Inference Server (RHAIIS) runs in the rhaiis namespace and serves a model on port 8000. Hermes Agent runs in a separate hermes namespace and talks to the vLLM server over the internal cluster service network, using the DNS name rhaiis-vllm.rhaiis.svc.cluster.local:8000. No public route is involved in that hop.

OpenRouter is wired as an automatic fallback. If the vLLM server is unavailable or returns an error, Hermes falls back to a remote model through OpenRouter without requiring any manual intervention.

Externally, Hermes exposes an OpenAI-compatible API on port 8642, secured with a bearer token. An OpenShift Route with TLS termination handles the public endpoint.

Prerequisites

The RHAIIS deployment from the previous post must be running in the rhaiis namespace with a deployment named rhaiis-vllm.
An OpenRouter API key for the fallback model.

Deploying Hermes Agent

All deployment files are available in the smichard/agent_on_ocp GitHub repository. The steps below apply them in sequence.

Clone the repository:

git clone https://github.com/smichard/agent_on_ocp.git
cd hermes_on_ocp

Create the Namespace and ServiceAccount

oc new-project hermes

Hermes Agent runs as UID 10000. The default restricted SCC in OpenShift does not allow this, so the deployment needs a dedicated ServiceAccount with the anyuid SCC:

oc create serviceaccount hermes -n hermes

oc adm policy add-scc-to-user anyuid \
 -z hermes \
 -n hermes

Create Secrets

Three secrets are needed: one for the vLLM bearer token, one for the OpenRouter fallback key, and one for the Hermes API server key that clients must present.

vLLM bearer token:

Hermes reads the OPENAI_API_KEY environment variable for custom OpenAI-compatible endpoints. The vLLM API key from the rhaiis namespace is passed in under that name:

export RHAIIS_API_KEY=$(oc get secret vllm-api-key-secret -n rhaiis \
 -o jsonpath='{.data.VLLM_API_KEY}' | base64 -d)

oc create secret generic hermes-vllm-secret \
 --from-literal=OPENAI_API_KEY="${RHAIIS_API_KEY}" \
 -n hermes

OpenRouter fallback key:

oc create secret generic hermes-openrouter-secret \
 --from-literal=OPENROUTER_API_KEY=<your_openrouter_key> \
 -n hermes

Hermes API server key:

Clients calling the Hermes API must include this key as a bearer token. Generate a random value at creation time:

oc create secret generic hermes-api-secret \
 --from-literal=API_SERVER_KEY=$(openssl rand -hex 32) \
 -n hermes

Retrieve it later with:

oc get secret hermes-api-secret -n hermes \
 -o jsonpath='{.data.API_SERVER_KEY}' | base64 -d

Create the ConfigMap

The ConfigMap holds the Hermes Agent configuration file. It sets the primary model provider to the internal vLLM service and configures OpenRouter as the fallback:

apiVersion: v1
kind: ConfigMap
metadata:
 name: hermes-config
 namespace: hermes
 labels:
 app: hermes
data:
 config.yaml: |
 model:
 provider: "custom"
 base_url: "http://rhaiis-vllm.rhaiis.svc.cluster.local:8000/v1"
 default: "Qwen/Qwen3-Coder-30B-A3B-Instruct"

 fallback_model:
 provider: "openrouter"
 model: "anthropic/claude-sonnet-4-6"

 terminal:
 backend: "local"
 cwd: "/opt/data/workspace"
 timeout: 180
 lifetime_seconds: 300

 compression:
 enabled: true
 threshold: 0.50
 target_ratio: 0.20
 protect_last_n: 20

Adjust model.default to match the --served-model-name value used in the RHAIIS deployment. Adjust fallback_model.model to the OpenRouter model you want to use as a fallback.

oc apply -f configmap.yaml

Create a PersistentVolumeClaim

Hermes stores sessions, memories, and workspace data on a persistent volume:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: hermes-data
 namespace: hermes
 labels:
 app: hermes
spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 5Gi

Apply the file to create the PVC:

oc apply -f pvc.yaml

Deploy Hermes Agent

The Deployment mounts the ConfigMap and the PVC, injects the three secrets as environment variables, and runs the container as the hermes ServiceAccount. Check the Hermes Agent repository for the current container image reference before applying:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: hermes
 namespace: hermes
 labels:
 app: hermes
spec:
 replicas: 1
 selector:
 matchLabels:
 app: hermes
 template:
 metadata:
 labels:
 app: hermes
 spec:
 tolerations:
 - key: nvidia.com/gpu
 effect: NoSchedule
 operator: Exists
 serviceAccountName: hermes
 securityContext:
 runAsUser: 10000
 runAsNonRoot: true
 fsGroup: 10000
 volumes:
 - name: hermes-data
 persistentVolumeClaim:
 claimName: hermes-data
 - name: hermes-config
 configMap:
 name: hermes-config
 containers:
 - name: hermes
 image: nousresearch/hermes-agent:latest
 imagePullPolicy: Always
 args: ["gateway"]
 env:
 - name: OPENAI_API_KEY
 valueFrom:
 secretKeyRef:
 name: hermes-vllm-secret
 key: OPENAI_API_KEY
 - name: OPENROUTER_API_KEY
 valueFrom:
 secretKeyRef:
 name: hermes-openrouter-secret
 key: OPENROUTER_API_KEY
 - name: API_SERVER_KEY
 valueFrom:
 secretKeyRef:
 name: hermes-api-secret
 key: API_SERVER_KEY
 - name: API_SERVER_ENABLED
 value: "true"
 - name: API_SERVER_HOST
 value: "0.0.0.0"
 - name: API_SERVER_PORT
 value: "8642"
 ports:
 - name: api
 containerPort: 8642
 protocol: TCP
 volumeMounts:
 - name: hermes-data
 mountPath: /opt/data
 - name: hermes-config
 mountPath: /opt/data/config.yaml
 subPath: config.yaml
 resources:
 requests:
 cpu: "500m"
 memory: "512Mi"
 limits:
 cpu: "2"
 memory: "2Gi"
 livenessProbe:
 tcpSocket:
 port: 8642
 initialDelaySeconds: 30
 periodSeconds: 30
 readinessProbe:
 tcpSocket:
 port: 8642
 initialDelaySeconds: 15
 periodSeconds: 10
 restartPolicy: Always

Apply the file to create the deployment:

oc apply -f deployment.yaml

The API server is ready when the logs show:

[api_server] Listening on 0.0.0.0:8642

Hermes Gateway starting up, with 83 skills bundled and the messaging platform scheduler ready to accept requests

Create a Service and Route

Create a Service that maps port 8642 to port 8642 on the pod:

apiVersion: v1
kind: Service
metadata:
 name: hermes
 namespace: hermes
 labels:
 app: hermes
spec:
 selector:
 app: hermes
 ports:
 - name: api
 protocol: TCP
 port: 8642
 targetPort: 8642

Create a TLS-terminated Route to expose the endpoint outside the cluster (optional):

apiVersion: route.openshift.io/v1
kind: Route
metadata:
 name: hermes
 namespace: hermes
 labels:
 app: hermes
spec:
 to:
 kind: Service
 name: hermes
 port:
 targetPort: api
 tls:
 termination: edge
 insecureEdgeTerminationPolicy: Redirect

Apply both and retrieve the assigned hostname:

oc apply -f service.yaml
oc apply -f route.yaml
oc get route hermes -n hermes -o jsonpath='{.spec.host}'

Testing the Endpoint

Store the hostname and API key in shell variables to keep the commands readable:

export HERMES_HOST=$(oc get route hermes -n hermes \
 -o jsonpath='{.spec.host}')
export HERMES_KEY=$(oc get secret hermes-api-secret -n hermes \
 -o jsonpath='{.data.API_SERVER_KEY}' | base64 -d)

Verify that the variables are populated before proceeding:

echo "HERMES_HOST : ${HERMES_HOST}"
echo "HERMES_API_KEY : ${HERMES_KEY}"

List available models:

curl -sS \
 "https://${HERMES_HOST}/v1/models" \
 -H "Authorization: Bearer ${HERMES_KEY}" | jq -r '.data[].id'

Send a chat completion request:

curl -sS \
 "https://${HERMES_HOST}/v1/chat/completions" \
 -H "Authorization: Bearer ${HERMES_KEY}" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "Qwen2.5-1.5B-Instruct",
 "messages": [{"role": "user", "content": "What is OpenShift?"}]
 }' | jq -r '.choices[0].message.content'

A successful response confirms that Hermes is running, the API key is working, and the request reached the vLLM server over the internal cluster network.

To verify the fallback path, scale down the RHAIIS deployment temporarily and send the same request. Hermes should return a response via OpenRouter instead:

oc scale deployment rhaiis-vllm -n rhaiis --replicas=0
# send a request, observe fallback in hermes logs
oc scale deployment rhaiis-vllm -n rhaiis --replicas=1

Changing the Model

Update configmap.yaml and set model.default to any model name served by the vLLM instance. The value must match the --served-model-name argument used in the RHAIIS deployment. Apply the updated ConfigMap and restart the Hermes deployment to pick up the change:

oc apply -f configmap.yaml
oc rollout restart deployment/hermes -n hermes

Connecting to Open WebUI

Hermes Agent exposes a standard OpenAI-compatible API, which means Open WebUI can connect to it directly as an external provider. As described in the prvious cases it is very easy to add the Hermes endpoint to the existing stack.

In Open WebUI, go to Settings > Connections and add a new external connection. Set the URL to the Hermes route hostname with the /v1 suffix, add the Hermes API server key created in step 3 as a bearer token, set the provider type to OpenAI, and the API type to Chat Completions. Leave the model ID field empty so Open WebUI queries the /v1/models endpoint and discovers available models automatically.

Open WebUI external connection configured against the Hermes Agent endpoint

Once saved, the model appears in the model selector alongside any other configured providers. Requests go from Open WebUI through Hermes to the vLLM server running on the same cluster.

Hermes agent is available in the Open WebUI interface alongside the model served by RHAIIS

Conclusion

This setup places Hermes Agent inside the same OpenShift cluster as the inference server and routes all model traffic over the internal service network. The public Hermes API endpoint is secured with a separate bearer token, so the vLLM key never leaves the cluster. OpenRouter handles the fallback case without any changes to the application code. The result is a self-hosted agent that uses a self-hosted model for most requests and degrades gracefully when the local server is unavailable.

References

GitHub repository with eployment files - link
Deploying OpenShift on AWS with Automated Cluster Provisioning - link
Running the Red Hat AI Inference Server on OpenShift - link
Hermes Agent: A Personal AI That Gets More Useful Over Time - link
OpenRouter - link
OpenShift CLI (oc) - link
smichard/agent_on_ocp - GitHub repository - link
Hermes Agent - GitHub repository - link
Hermes Agent - Documentation - link
Nous Research - link

Running the Red Hat AI Inference Server on OpenShift

Sun, 17 May 2026 00:00:00 +0000

Drop-in OpenAI-compatible inference on OpenShift — RHAIIS packages vLLM for production, with hardware flexibility and a secure external endpoint out of the box - AI generated

Introduction

In this post, I want to describe how to deploy the Red Hat AI Inference Server (RHAIIS) on OpenShift and expose it as an OpenAI-compatible API endpoint. This post builds on Deploying OpenShift on AWS with Automated Cluster Provisioning, which covers getting a working OpenShift cluster into place. If you already have a cluster running, you can skip directly to the deployment steps.

The inference server will load a model from Hugging Face Hub and expose a /v1/chat/completions endpoint that any OpenAI-compatible client can talk to. At the end, I show how to connect the endpoint to the Open WebUI setup described in My Local AI Stack.

What is Red Hat AI Inference Server

vLLM is an open-source inference engine designed for high-throughput LLM serving. It handles memory-efficient attention via PagedAttention, continuous batching, and GPU-optimized execution, and it exposes an OpenAI-compatible HTTP API out of the box. I covered how to run vLLM on the GPU cloud provider RunPod in a previous post.

The Red Hat AI Inference Server is the supported, enterprise-packaged distribution of vLLM. Red Hat provides a hardened container image distributed through registry.redhat.io, tested against specific GPU driver and CUDA versions and with a defined support lifecycle. The API surface is identical to upstream vLLM. Any client that works against a plain vLLM inference server works against RHAIIS without modification.

Deploying RHAIIS directly on OpenShift is one way to reach a running inference endpoint through Red Hat technology. Red Hat OpenShift AI offers other paths, e.g. model serving through KServe, where OpenShift AI manages the deployment lifecycle via a web dashboard and exposes RHAIIS through a ServingRuntime, or a Model as a Service approach that provisions shared inference endpoints across a cluster, so teams can consume models without operating their own deployment. The approach in this post is the most direct option, suited for cases where you want a single inference endpoint.

Prerequisites

This setup requires the following:

A running OpenShift cluster with at least one GPU-enabled worker node. The post Deploying OpenShift on AWS covers one way to get there.
Node Feature Discovery (NFD) Operator installed and running to detect GPU hardware on the node.
NVIDIA GPU Operator installed to provide the CUDA runtime and device plugin.
OpenShift CLI (oc) – required to interact with the OpenShift cluster, installed and logged into the cluster.
A Hugging Face access token if you intend to use a gated model. Publicly available models like Granite do not require one.

Deploying the Red Hat AI Inference Server

The deployment consists of a namespace, two secrets, a PersistentVolumeClaim for model caching, a Deployment, a Service, and a Route. All deployment files are available in the smichard/agent_on_ocp GitHub repository. The steps below apply them in sequence.

Clone the repository:

git clone https://github.com/smichard/agent_on_ocp.git
cd rhaiis

Create a Namespace

oc new-project rhaiis

Create the required Secrets

Hugging Face access token:

oc create secret generic hf-secret \
 --from-literal=HF_TOKEN=<your_huggingface_token> \
 -n rhaiis

API key for the inference endpoint:

The server requires clients to present an API key as a bearer token. Storing it as a secret keeps it out of the Deployment spec.

oc create secret generic vllm-api-key-secret \
 --from-literal=VLLM_API_KEY=$(openssl rand -hex 32) \
 -n rhaiis

Create the ConfigMap

Set the Hugging Face model ID you want to serve. Research which model fits your use case before settling on one, the only hard requirement is that the model is supported by the vLLM inference server. The ConfigMap also carries the tool call parser name, which the deployment references to set the correct parsing mode for the chosen model.

apiVersion: v1
kind: ConfigMap
metadata:
 name: vllm-config
 namespace: rhaiis
data:
 MODEL_NAME: "Qwen/Qwen3-Coder-30B-A3B-Instruct"
 TOOL_CALL_PARSER: "qwen3_coder"

Apply the file to create the ConfigMap:

oc apply -f configmap.yaml

Create a PersistentVolumeClaim

The model weights are downloaded once on first startup and cached on a persistent volume. This avoids re-downloading the model on every pod restart.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: model-cache
 namespace: rhaiis
spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 150Gi

Apply the file to create the PVC:

oc apply -f pvc.yaml

Deploy the Inference Server

The Deployment below references the RHAIIS container image and pulls the model ID from the ConfigMap created in step 4. To serve a different model, update the ConfigMap rather than editing the Deployment spec. The HF_TOKEN and VLLM_API_KEY values are injected from the secrets created in step 3.

Note

Depending on the model size, the number of GPUs and the CPU and memory allocations will need to be adjusted. The example below was tested on an AWS g5.12xlarge node (4x NVIDIA A10G, 24 GB VRAM per GPU) and uses all four GPUs via tensor parallelism.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 replicas: 1
 selector:
 matchLabels:
 app: rhaiis-vllm
 template:
 metadata:
 labels:
 app: rhaiis-vllm
 spec:
 tolerations:
 - key: nvidia.com/gpu
 effect: NoSchedule
 operator: Exists
 serviceAccountName: default
 volumes:
 - name: model-cache
 persistentVolumeClaim:
 claimName: model-cache
 - name: shm
 emptyDir:
 medium: Memory
 sizeLimit: "16Gi"
 containers:
 - name: vllm
 image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.1-1775680192
 imagePullPolicy: Always
 env:
 - name: HF_TOKEN
 valueFrom:
 secretKeyRef:
 name: hf-secret
 key: HF_TOKEN
 - name: VLLM_API_KEY
 valueFrom:
 secretKeyRef:
 name: vllm-api-key-secret
 key: VLLM_API_KEY
 - name: MODEL_NAME
 valueFrom:
 configMapKeyRef:
 name: vllm-config
 key: MODEL_NAME
 - name: HF_HOME
 value: /cache
 - name: HF_HUB_OFFLINE
 value: '0'
 - name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
 value: '1'
 - name: TOOL_CALL_PARSER
 valueFrom:
 configMapKeyRef:
 name: vllm-config
 key: TOOL_CALL_PARSER
 command:
 - python
 - '-m'
 - vllm.entrypoints.openai.api_server
 args:
 - '--port=8000'
 - '--model=$(MODEL_NAME)'
 - '--served-model-name=$(MODEL_NAME)'
 - '--tensor-parallel-size=4'
 - '--gpu-memory-utilization=0.85'
 - '--max-model-len=65536'
 - '--enable-auto-tool-choice'
 - '--tool-call-parser=$(TOOL_CALL_PARSER)'
 resources:
 limits:
 cpu: '10'
 nvidia.com/gpu: '4'
 memory: 128Gi
 requests:
 cpu: '2'
 memory: 32Gi
 nvidia.com/gpu: '4'
 volumeMounts:
 - name: model-cache
 mountPath: /cache
 - name: shm
 mountPath: /dev/shm
 restartPolicy: Always

Apply the file to create the deployment:

oc apply -f deployment.yaml

The container reads the model ID from the ConfigMap at startup and downloads it from HuggingFace into /cache (backed by the PVC). Initial startup takes several minutes depending on model size and network speed. Follow the progress with:

oc logs -f deployment/rhaiis-vllm -n rhaiis

The server is ready when the log shows Application startup complete.

vLLM server log output on startup, showing all registered API routes and the final Application startup complete confirmation

Once the pod is running, you can verify GPU access from the pod terminal with nvidia-smi. All four GPUs should be visible, each running a tensor-parallel worker process.

nvidia-smi output from inside the vLLM pod, confirming all four A10G GPUs are visible and each tensor-parallel worker has allocated approximately 20 GB of VRAM

Create a Service and Route

Create a Service that maps port 80 to port 8000 on the pod:

apiVersion: v1
kind: Service
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 selector:
 app: rhaiis-vllm
 ports:
 - name: http
 protocol: TCP
 port: 8000
 targetPort: 8000

Create a TLS-terminated Route if you want to expose the endpoint outside the cluster:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 to:
 kind: Service
 name: rhaiis-vllm
 port:
 targetPort: http
 tls:
 termination: edge
 insecureEdgeTerminationPolicy: Redirect

Apply both and retrieve the assigned hostname:

oc apply -f service.yaml
oc apply -f route.yaml
oc get route rhaiis-vllm -n rhaii-namespace -o jsonpath='{.spec.host}'

OpenShift builds the hostname from the route and namespace names following the pattern <route-name>-<namespace>.apps.<cluster-domain>. The result looks something like rhaiis-vllm-rhaiis-namespace.apps.ocp.example.com.

Testing the Endpoint

Store the hostname and API key in shell variables to keep the commands readable:

Set environment variables once:

export RHAIIS_HOST=$(oc get route rhaiis-vllm -n rhaiis -o jsonpath='{.spec.host}')
export RHAIIS_API_KEY=$(oc get secret vllm-api-key-secret -n rhaiis \
 -o jsonpath='{.data.VLLM_API_KEY}' | base64 -d)
export MODEL=$(oc get configmap vllm-config -n rhaiis \
 -o jsonpath='{.data.MODEL_NAME}')

Verify all three are populated before proceeding:

echo "RHAIIS_HOST : ${RHAIIS_HOST}"
echo "RHAIIS_API_KEY : ${RHAIIS_API_KEY}"
echo "Model: ${MODEL}"

**List available models:**

```bash
curl -s https://$RHAIIS_HOST/v1/models \
 -H "Authorization: Bearer $RHAIIS_API_KEY" | jq .

Send a chat completion request:

curl -sS \
 "https://${RHAIIS_HOST}/v1/chat/completions" \
 -H "Authorization: Bearer ${RHAIIS_API_KEY}" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "'"${MODEL}"'",
 "messages": [{"role": "user", "content": "What is OpenShift?"}],
 "temperature": 0.1,
 "max_tokens": 200
 }' | jq -r '.choices[0].message.content'

A successful response confirms the server is running, the model is loaded, and the API key authentication is working.

Connecting to Open WebUI

The inference server exposes a standard OpenAI-compatible API, which means Open WebUI can connect to it directly as an external provider. The setup in My Local AI Stack already runs Open WebUI. Adding the RHAIIS endpoint as a direct external connection requires no changes to the existing stack.

In Open WebUI, go to Settings > Connections and add a new external connection. Set the URL to the route hostname with the /v1 suffix, add the API key created in step 3 as a bearer token, set the provider type to OpenAI, and the API type to Chat Completions. Leave the model ID field empty so Open WebUI queries the /v1/models endpoint and discovers available models automatically.

Open WebUI external connection configured against the Red Hat AI Inference Server endpoint

Once saved, the deployed model appears in the model selector alongside any other configured providers.

Conclusion

The Red Hat AI Inference Server puts the vLLM engine into OpenShift, or any other supported platform, with a supported container image and a deployment pattern that fits standard Kubernetes workflows. The outcome is an OpenAI-compatible endpoint running on your own cluster, backed by a model from Hugging Face Hub, secured with an API key, and accessible over a TLS-terminated OpenShift Route. Any client that speaks the OpenAI Chat Completions format can talk to it, including Open WebUI, which connects to it the same way it connects to any other provider.

References

GitHub repository with eployment files - link
Deploying OpenShift on AWS with Automated Cluster Provisioning - link
My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - link
Extending the Local AI Stack with On-Demand GPU Inference on RunPod - link
Model as a Service GitHub repository - link
Node Feature Discovery Operator - link
NVIDIA GPU Operator - link
OpenShift CLI (oc) - link
Granite family of models on Hugging Face - link
smichard/agent_on_ocp - GitHub repository - link
Red Hat AI Inference Server - Documentation - link
Deploying Red Hat AI Inference Server on OpenShift - link
vLLM - upstream project - link
vLLM - OpenAI-compatible server documentation - link
Open WebUI - project site - link

Extending the Local AI Stack with On-Demand GPU Inference on RunPod

Sat, 07 Mar 2026 00:00:00 +0000

Conceptual illustration of the extended AI stack with elastic cloud GPU resources for running large language models on demand - AI generated

Introduction

In this post, I want to describe how I extended the local AI stack I built in my homelab with on-demand GPU-backed model inference, without adding any GPU hardware to the lab itself.

The two previous posts in this series provide the context for what follows. The homelab post covers the base infrastructure: thin clients, Docker Compose, Traefik, and internal DNS. The local AI stack post describes how Open WebUI, LiteLLM, SearXNG, and Docling sit on top of that infrastructure to form a self-hosted AI environment. That stack works well, and I have been using it for a while. Keeping the lab CPU-only is a deliberate choice. For orchestration, document workflows, and routing requests to publicly available AI services, dedicated GPU hardware at home is simply not necessary. When I want to try a particular model that is not available through a managed API, or experiment with something freshly released on Hugging Face, I rent the compute on demand rather than maintain it permanently.

The solution is straightforward: rent GPU capacity on demand from a specialized cloud provider, expose it as an OpenAI-compatible endpoint, and wire it into the existing stack. No new hardware, no permanent cost, no changes to the tools I already use.

A Note on Neo Clouds

The providers that specialize in this type of GPU-first infrastructure are sometimes called Neo Clouds. The term emerged around 2024 to distinguish GPU-specialist vendors such as RunPod, CoreWeave and others from traditional hyperscalers. In practice, I am not sure the new term adds much. For me these are specialized cloud providers focused on GPU compute and AI workloads. Useful services, somewhat unnecessary branding.

Why RunPod

I use RunPod for this setup for a few practical reasons. The interface is intuitive, the deployment path from template to running pod is short, and the GPU catalog is broad enough to cover most use cases. Pricing is per second with no ingress or egress fees, which makes on-demand experimentation economical. RunPod also exposes an API for its core operations, so deployments can be automated rather than driven entirely through the UI.

A detailed description of all RunPod services is out of scope for this post. The focus here is on one specific workflow: deploying a vLLM inference server with a model loaded from Hugging Face, and connecting the resulting endpoint to Open WebUI.

Deploying a vLLM Inference Server on RunPod

RunPod uses templates to save pod configurations for reuse. A template defines the container image, the start command, the storage allocation, and other runtime parameters. I maintain a small collection of private templates, each configured for a different model.

A selection of saved vLLM templates on RunPod, each using to a different model from Hugging Face

The container image for all of these templates is vllm/vllm-openai:latest, which bundles vLLM with an OpenAI-compatible API server. The model itself is specified in the container start command, which means swapping models is a matter of editing a single line.

Creating a Template

When creating or editing a template, the key fields are:

Type: Pod
Compute type: Nvidia GPU
Container image: vllm/vllm-openai:latest
Container start command: the vLLM arguments, including the model reference

Template configuration for the vllm_gemma-3-12b template, showing the container image and start command

Throughout the following steps, any value written in <angle brackets> is a placeholder and must be replaced with your actual value before running the command.

A start command for deploying the Red Hat’s validated RedHatAI/Qwen3-8B-FP8-dynamic model looks like this:

--host 0.0.0.0 --port 8000 \
 --model RedHatAI/Qwen3-8B-FP8-dynamic \
 --dtype bfloat16 \
 --enforce-eager \
 --gpu-memory-utilization 0.95 \
 --api-key <api_key> \
 --max-model-len 8128

The parameters worth noting:

--model: any model available on Hugging Face can be referenced here by its repository path
--dtype bfloat16: sets the compute dtype; bfloat16 is a good default for inference on NVIDIA hardware
--enforce-eager: disables CUDA graph capture, which reduces memory overhead at the cost of some throughput; useful when fitting larger models on a single GPU
--gpu-memory-utilization 0.95: allows vLLM to use up to 95% of available GPU memory for the KV cache
--api-key: sets a bearer token for the OpenAI-compatible endpoint; always set this when deploying a public endpoint
--max-model-len: caps the maximum sequence length; reducing this frees memory and allows larger models to fit on smaller GPUs

Selecting a GPU and Deploying

Once the template is configured, deploying it requires selecting a GPU and clicking deploy. RunPod shows available hardware with current pricing.

GPU selection on RunPod, ranging from RTX 2000 Ada class cards to H200 and B200 datacenter accelerators

For most inference workloads with 8 to 12 billion parameter models, an RTX 4090 or L4 is a practical and cost-effective choice. Larger models with higher memory requirements will need 48 GB or 80 GB class cards. The per-hour pricing shown in the interface makes it easy to estimate cost for a session before committing.

After deployment, RunPod assigns a public HTTPS endpoint to the pod. The vLLM server is reachable at that endpoint on port 8000, with the path structure matching the OpenAI API.

Connecting the Endpoint to Open WebUI

With the pod running and the model loaded, the endpoint can be added to Open WebUI as an external connection. In Open WebUI, navigate to Admin Panel then Settings and add a new connection with the following values:

Connection type: External
URL: https://<runpod_endpoint>/v1
Auth: API key set in the vLLM start command
Provider type: OpenAI
API type: Chat Completions

Adding the RunPod vLLM endpoint as an external OpenAI-compatible connection in Open WebUI

Once saved, the model served by vLLM on RunPod appears in the model selector alongside any other configured backends. From a user perspective, the interface is identical to any other configured model, whether local or a commercial API.

Alternatively, the endpoint can be added to LiteLLM as a named model alias. This is the better option if you want centralized credential management or want to expose the RunPod model alongside other backends under a consistent naming scheme across the stack.

Why This Setup Works Well

The combination of a self-hosted orchestration stack and on-demand GPU inference fits well with a homelab where tooling and workflows are in place but on-premises compute is intentionally kept lean.

A few things make this pattern practical:

Low cost for experimentation. Models run only when needed. A session of an hour or two to test a new model costs a few dollars at most.
Access to current models. Many of the recently published models available on Hugging Face can be loaded into vLLM, which means it is straightforward to test recently released models without waiting for them to appear in a managed API.
No changes to the existing stack. Open WebUI, LiteLLM, SearXNG, and Docling continue to work exactly as before. The RunPod endpoint is just another backend.
Automatable. RunPod exposes an API for managing pods, so deployments can be triggered programmatically. Combined with LiteLLM’s routing, it becomes possible to bring a model endpoint up on demand and tear it down again when it is no longer needed.

Conclusion

Adding RunPod as an on-demand GPU backend closes the main gap in a CPU-only homelab AI stack. The setup requires no changes to the existing infrastructure and takes only a few minutes from template to running endpoint. The result is the ability to experiment with current, capable models at low cost, using the same interface and workflows already in place.

For on-demand model access that does not warrant the cost of persistent GPU hardware, this pattern is worth considering.

References

My Homelab: A Traefik-centered Self-hosting Setup - link
My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - link
RunPod - project site - link
RunPod - documentation - link
vLLM - project site - link
Hugging Face - model hub - link
RedHatAI models on Hugging Face - link

Vllm on Home

Deploying Hermes Agent on OpenShift

Hermes Agent on OpenShift: Private by default, cloud access when needed - AI generated

Introduction

Architecture

Prerequisites

Deploying Hermes Agent

Hermes Gateway starting up, with 83 skills bundled and the messaging platform scheduler ready to accept requests

Testing the Endpoint

Changing the Model

Connecting to Open WebUI

Open WebUI external connection configured against the Hermes Agent endpoint

Hermes agent is available in the Open WebUI interface alongside the model served by RHAIIS

Conclusion

References

Running the Red Hat AI Inference Server on OpenShift

Drop-in OpenAI-compatible inference on OpenShift — RHAIIS packages vLLM for production, with hardware flexibility and a secure external endpoint out of the box - AI generated

Introduction

What is Red Hat AI Inference Server

Prerequisites

Deploying the Red Hat AI Inference Server

vLLM server log output on startup, showing all registered API routes and the final Application startup complete confirmation

nvidia-smi output from inside the vLLM pod, confirming all four A10G GPUs are visible and each tensor-parallel worker has allocated approximately 20 GB of VRAM

Testing the Endpoint

Connecting to Open WebUI

Open WebUI external connection configured against the Red Hat AI Inference Server endpoint

Conclusion

References

Extending the Local AI Stack with On-Demand GPU Inference on RunPod

Conceptual illustration of the extended AI stack with elastic cloud GPU resources for running large language models on demand - AI generated

Introduction

A Note on Neo Clouds

Why RunPod

Deploying a vLLM Inference Server on RunPod

A selection of saved vLLM templates on RunPod, each using to a different model from Hugging Face

Creating a Template

Template configuration for the vllm_gemma-3-12b template, showing the container image and start command

Selecting a GPU and Deploying

GPU selection on RunPod, ranging from *RTX 2000 Ada* class cards to *H200* and *B200* datacenter accelerators

Connecting the Endpoint to Open WebUI

Adding the RunPod vLLM endpoint as an external OpenAI-compatible connection in Open WebUI

Why This Setup Works Well

Conclusion

References

GPU selection on RunPod, ranging from RTX 2000 Ada class cards to H200 and B200 datacenter accelerators