Running 34 LLMs on Consumer Hardware: A Practical Guide
How I run 34 large language models locally using Ollama on consumer GPUs, with practical tips on model selection, performance, and integration.
Why Run LLMs Locally?
Cloud AI APIs are powerful, but there are compelling reasons to run models locally:
- Privacy: Sensitive data never leaves your network
- Cost: No per-token charges for experimentation
- Latency: Local inference can be faster than API round-trips for smaller models
- Learning: Understanding model behavior at the hardware level
- Availability: No dependency on external services or rate limits
The Hardware
My inference workstation runs two NVIDIA GPUs in the same system:
| GPU | VRAM | Role |
|---|---|---|
| RTX 5090 | 32 GB | Primary inference — runs large models (70B quantized, 32B full) |
| RTX 3050 | 8 GB | Secondary — embedding models, small assistants |
Ollama handles model management and serving. It automatically distributes layers across available GPUs and spills to CPU/RAM when needed.
Model Selection Strategy
Not every task needs a 70B model. I've organized my 34 models by use case:
# List loaded models with sizes
ollama list
# Example output (abbreviated)
NAME SIZE QUANTIZATION
llama3.3:70b-instruct 39 GB Q4_K_M
qwen2.5:32b 18 GB Q4_K_M
deepseek-coder-v2:16b 8.9 GB Q4_K_M
nomic-embed-text 274 MB F16
My model tiers:
- Heavy reasoning (70B class): Complex analysis, long-form writing, multi-step planning
- General purpose (32B class): Day-to-day coding assistance, summarization, Q&A
- Code-specific (16B class): Fast code completion, refactoring, test generation
- Embeddings (small): Vector representations for RAG pipelines and semantic search
Performance Benchmarks
Real inference speeds from my hardware (tokens per second, measured with ollama run):
| Model | Size | Tokens/sec | Notes |
|---|---|---|---|
| llama3.3:70b Q4_K_M | 39 GB | ~18 t/s | Fits entirely on RTX 5090 |
| qwen2.5:32b Q4_K_M | 18 GB | ~35 t/s | Excellent quality-to-speed ratio |
| deepseek-coder-v2:16b | 8.9 GB | ~55 t/s | Best coding model for the size |
| phi3:3.8b | 2.3 GB | ~120 t/s | Fast for simple tasks |
Serving Architecture
Ollama runs as a system service on the workstation and serves the entire network:
# Ollama serves on all interfaces
OLLAMA_HOST=0.0.0.0:11434
# Any machine on the network can query models
curl http://workstation:11434/api/generate -d '{
"model": "qwen2.5:32b",
"prompt": "Explain Kubernetes pod scheduling",
"stream": false
}'
Open WebUI provides a ChatGPT-like interface that connects to Ollama. It runs as a Docker container on my server and is accessible from any device on the network.
RAG Pipeline Integration
Raw LLMs are useful, but grounded LLMs are powerful. I've built RAG (Retrieval Augmented Generation) pipelines that:
- Ingest documents — PDFs, markdown, code files chunked and embedded
- Store vectors — Using PostgreSQL with pgvector (SQL Server 2025 native vectors for newer projects)
- Retrieve context — Semantic search finds relevant chunks at query time
- Generate answers — Local LLM synthesizes responses grounded in your actual data
# Simplified RAG query flow
from pydantic_ai import Agent
agent = Agent(
model="ollama:qwen2.5:32b",
system_prompt="Answer based on the provided context.",
)
# Retrieved context injected into the prompt
result = agent.run_sync(
f"Context: {retrieved_chunks}\n\nQuestion: {user_query}"
)
Lessons Learned
- VRAM is king. More VRAM means larger models, which means better output quality. Budget for GPU memory above all else.
- Quantization is free performance. Q4_K_M quantization loses negligible quality while halving memory requirements. Always quantize unless you need research-grade precision.
- Context length matters. A model that supports 128K context but runs at 2 t/s isn't practical. Find the sweet spot for your use case.
- Keep models warm. Loading a 39GB model takes 10-15 seconds. Keep frequently-used models loaded in memory with
ollama keep-alive. - Monitor GPU thermals. Sustained inference pushes GPUs hard. I monitor temps via Prometheus and throttle workloads above 85°C.
Running LLMs locally isn't just a hobby — it's become a core part of how I build and test AI applications before deploying them to Azure. The feedback loop from local experimentation to production deployment is invaluable.