Building a Production Proxmox Cluster with Ceph Storage

How I built a 3-node Proxmox cluster with Ceph distributed storage, providing high availability and software-defined storage for my home lab.

Building a Production Proxmox Cluster with Ceph Storage

Why Proxmox + Ceph?

When your home lab grows beyond a single server, you need two things: high availability (so a node failure doesn't take everything down) and shared storage (so VMs can migrate between nodes). Proxmox VE with integrated Ceph provides both in a single, free, open-source platform.

The Cluster

Three nodes of varying capability, because home labs run on whatever hardware you can get your hands on:

NodeCPURAMCeph OSDsRole
Node 1Dual Xeon E5-2680 v4 (28C/56T)220 GB8 SSD + 6 HDDPrimary workloads
Node 2Xeon X3470 (4C/8T)32 GB2 SSD + 4 HDDLightweight VMs, Ceph
Node 3i7-4810MQ (4C/8T)32 GB1 HDDGPU passthrough, Ceph

Total raw Ceph storage: ~22 TiB across 21 OSDs. The cluster has been running since early 2025 with zero unplanned downtime.

Network Architecture

Ceph demands a dedicated network for replication traffic. Mixing client and replication traffic on the same network is a recipe for latency spikes and split-brain scenarios.

Management/Client Network: x.x.x.0/24  (all nodes)
Ceph Cluster Network:      x.x.x.0/24  (dedicated replication)

Each Proxmox node has two network interfaces — one for client traffic (VM access, API, web UI) and one exclusively for Ceph replication. This separation ensures that a large Ceph recovery operation won't impact VM network performance.

Ceph Storage Design

The storage architecture uses tiered pools to match workload requirements:

# Pool layout (simplified)
# SSD Pools (fast, for active VM disks)
rbd_ssd         - replicated, size 2  # All active VM disks live here
cephfs_ssd      - replicated, size 2  # SSD-backed CephFS for fast file access

# HDD Pools (capacity, for bulk data)
cephfs_data     - replicated, size 2  # Large file storage, media
ceph-rbd        - replicated, size 3  # Available for bulk block storage

Why different replication factors? With only 2 nodes hosting SSDs, SSD pools can only replicate twice. HDD pools span all 3 nodes, so critical metadata uses size 3 for maximum durability.

VM Storage Strategy

All active VM disks run on the SSD RBD pool. This gives consistent IOPS for database workloads and container orchestration. The HDD CephFS pool handles bulk data — media libraries, backups, shared development resources.

# Example: Create a VM with SSD-backed storage
qm create 110   --name "new-workload"   --memory 8192   --cores 4   --scsihw virtio-scsi-single   --scsi0 rbd_ssd:vm-110-disk-0,size=50G,iothread=1   --net0 virtio,bridge=vmbr0

Live Migration

The killer feature of Ceph-backed Proxmox is seamless live migration. Because all nodes access the same Ceph storage, VMs can migrate between nodes with zero downtime:

# Migrate VM 100 from node1 to node2 with zero downtime
qm migrate 100 node2 --online

I use this for rolling maintenance — update one node at a time, migrating VMs off before rebooting. The cluster has never had an unplanned outage for maintenance.

Backup Strategy

Proxmox Backup Server (PBS) runs as an LXC container with a CephFS-backed datastore (~4.3 TB). Daily vzdump jobs at 2 AM back up all VMs and containers with deduplication.

# Automated backup schedule (vzdump.conf)
# Daily at 2 AM, all VMs and containers
# Retention: 7 daily, 4 weekly, 3 monthly
schedule: *-*-* 02:00:00
mode: snapshot
compress: zstd
retention-days: 7
retention-weeks: 4
retention-months: 3

Maintenance Automation

Cron jobs run health checks and maintenance tasks across the cluster:

  • Daily at 6 AM: Cluster health check — reports OSD status, PG states, disk usage
  • Sunday at 3 AM: Maintenance script — clears old kernels, rotates logs, checks for updates
  • Daily at 3 AM: CephFS snapshots for point-in-time recovery

Lessons Learned

  1. Start with 3 nodes minimum. Ceph needs quorum. Two nodes can't form consensus during a network partition.
  2. Dedicated Ceph network is mandatory. Not optional. Not "nice to have." Mandatory. Replication traffic will destroy your VM network otherwise.
  3. Match OSD sizes within pools. Mismatched OSD sizes cause uneven data distribution and PG undersizing warnings.
  4. Monitor Ceph like production. HEALTH_WARN is a warning, not an FYI. Fix it before it becomes HEALTH_ERR.
  5. Test recovery regularly. Pull a node's power cable once a quarter. If your cluster can't survive that, it's not HA — it's pretending to be.

The Proxmox + Ceph combination gives enterprise-grade infrastructure at home lab prices. It's the foundation everything else in my lab runs on, and I wouldn't trade it for anything.

Link copied