Running 34 LLMs on Consumer Hardware: A Practical Guide

How I run 34 large language models locally using Ollama on consumer GPUs, with practical tips on model selection, performance, and integration.

Running 34 LLMs on Consumer Hardware: A Practical Guide

Why Run LLMs Locally?

Cloud AI APIs are powerful, but there are compelling reasons to run models locally:

  • Privacy: Sensitive data never leaves your network
  • Cost: No per-token charges for experimentation
  • Latency: Local inference can be faster than API round-trips for smaller models
  • Learning: Understanding model behavior at the hardware level
  • Availability: No dependency on external services or rate limits

The Hardware

My inference workstation runs two NVIDIA GPUs in the same system:

GPUVRAMRole
RTX 509032 GBPrimary inference — runs large models (70B quantized, 32B full)
RTX 30508 GBSecondary — embedding models, small assistants

Ollama handles model management and serving. It automatically distributes layers across available GPUs and spills to CPU/RAM when needed.

Model Selection Strategy

Not every task needs a 70B model. I've organized my 34 models by use case:

# List loaded models with sizes
ollama list

# Example output (abbreviated)
NAME                    SIZE      QUANTIZATION
llama3.3:70b-instruct   39 GB     Q4_K_M
qwen2.5:32b             18 GB     Q4_K_M
deepseek-coder-v2:16b   8.9 GB    Q4_K_M
nomic-embed-text         274 MB    F16

My model tiers:

  • Heavy reasoning (70B class): Complex analysis, long-form writing, multi-step planning
  • General purpose (32B class): Day-to-day coding assistance, summarization, Q&A
  • Code-specific (16B class): Fast code completion, refactoring, test generation
  • Embeddings (small): Vector representations for RAG pipelines and semantic search

Performance Benchmarks

Real inference speeds from my hardware (tokens per second, measured with ollama run):

ModelSizeTokens/secNotes
llama3.3:70b Q4_K_M39 GB~18 t/sFits entirely on RTX 5090
qwen2.5:32b Q4_K_M18 GB~35 t/sExcellent quality-to-speed ratio
deepseek-coder-v2:16b8.9 GB~55 t/sBest coding model for the size
phi3:3.8b2.3 GB~120 t/sFast for simple tasks

Serving Architecture

Ollama runs as a system service on the workstation and serves the entire network:

# Ollama serves on all interfaces
OLLAMA_HOST=0.0.0.0:11434

# Any machine on the network can query models
curl http://workstation:11434/api/generate -d '{
  "model": "qwen2.5:32b",
  "prompt": "Explain Kubernetes pod scheduling",
  "stream": false
}'

Open WebUI provides a ChatGPT-like interface that connects to Ollama. It runs as a Docker container on my server and is accessible from any device on the network.

RAG Pipeline Integration

Raw LLMs are useful, but grounded LLMs are powerful. I've built RAG (Retrieval Augmented Generation) pipelines that:

  1. Ingest documents — PDFs, markdown, code files chunked and embedded
  2. Store vectors — Using PostgreSQL with pgvector (SQL Server 2025 native vectors for newer projects)
  3. Retrieve context — Semantic search finds relevant chunks at query time
  4. Generate answers — Local LLM synthesizes responses grounded in your actual data
# Simplified RAG query flow
from pydantic_ai import Agent

agent = Agent(
    model="ollama:qwen2.5:32b",
    system_prompt="Answer based on the provided context.",
)

# Retrieved context injected into the prompt
result = agent.run_sync(
    f"Context: {retrieved_chunks}\n\nQuestion: {user_query}"
)

Lessons Learned

  1. VRAM is king. More VRAM means larger models, which means better output quality. Budget for GPU memory above all else.
  2. Quantization is free performance. Q4_K_M quantization loses negligible quality while halving memory requirements. Always quantize unless you need research-grade precision.
  3. Context length matters. A model that supports 128K context but runs at 2 t/s isn't practical. Find the sweet spot for your use case.
  4. Keep models warm. Loading a 39GB model takes 10-15 seconds. Keep frequently-used models loaded in memory with ollama keep-alive.
  5. Monitor GPU thermals. Sustained inference pushes GPUs hard. I monitor temps via Prometheus and throttle workloads above 85°C.

Running LLMs locally isn't just a hobby — it's become a core part of how I build and test AI applications before deploying them to Azure. The feedback loop from local experimentation to production deployment is invaluable.

Link copied