Monitoring 95+ Docker Containers: A Complete Observability Stack
How I built a comprehensive monitoring stack with Grafana, Prometheus, Loki, and Alertmanager to observe 95+ Docker containers across my home lab.
The Challenge
When you're running 95+ Docker containers across a virtualized infrastructure, visibility becomes critical. Without proper monitoring, you're flying blind — containers crash silently, disks fill up overnight, and performance degrades without anyone noticing.
I needed a monitoring stack that could:
- Track metrics from every container, VM, and physical node
- Aggregate logs from all services into a searchable platform
- Alert me before problems become outages
- Scale without consuming all my resources
- Provide at-a-glance dashboards for quick triage
The Stack
After evaluating several options (Datadog was too expensive, Zabbix felt dated), I landed on the Grafana ecosystem:
| Component | Role | Why I Chose It |
|---|---|---|
| Prometheus | Metrics collection | Pull-based model, battle-tested, massive exporter ecosystem |
| Loki | Log aggregation | LogQL is powerful, integrates natively with Grafana, low resource footprint |
| Grafana | Visualization | 34 dashboards, flexible alerting, one pane of glass |
| Alertmanager | Alert routing | Deduplication, grouping, silencing — 40+ rules |
| Alloy | Telemetry pipeline | Replaced Promtail, handles logs + metrics collection |
Metrics Collection Architecture
Prometheus scrapes metrics from a constellation of exporters. Here's how the data flows:
# Example Prometheus scrape config (simplified)
scrape_configs:
- job_name: 'cadvisor'
scrape_interval: 15s
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'node-exporter'
scrape_interval: 30s
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'pve-exporter'
scrape_interval: 60s
metrics_path: /pve
static_configs:
- targets: ['pve-exporter:9221']
Key exporters in the stack:
- cAdvisor: Container-level CPU, memory, network, and disk I/O metrics
- Node Exporter: Host-level system metrics (CPU, memory, filesystem, network)
- PVE Exporter: Proxmox-specific metrics — VM states, cluster health, Ceph status
- UnPoller: UniFi network device metrics — AP clients, switch ports, throughput
- Blackbox Exporter: Endpoint probing — HTTP checks, DNS resolution, SSL expiry
- Postgres Exporter: Database connection pools, query performance, replication lag
34 Dashboards and Counting
Dashboards are organized by domain. A few highlights:
- Infrastructure Overview: Single dashboard showing all Proxmox nodes, Ceph health, and VM states
- Docker Fleet: Container status, restart counts, resource consumption sorted by usage
- Network Intelligence: UniFi device metrics, client counts per AP, inter-VLAN traffic
- Security Posture: Wazuh alert trends, IDS hits, failed authentication attempts
- Storage Health: Ceph OSD utilization, IOPS, recovery progress, pool distribution
Alert Rules That Actually Matter
I started with too many alerts and suffered alert fatigue within a week. The fix was ruthless prioritization. My current 40+ rules follow a tiered approach:
# Example alert rule (simplified)
groups:
- name: container_alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 10m
labels:
severity: critical
annotations:
summary: "{{ $labels.name }} using >90% memory limit"
Alerts route to different channels based on severity:
- Critical: Push notification via ntfy (immediate)
- Warning: Grouped digest every 30 minutes
- Info: Dashboard annotation only
Log Aggregation with Loki
Loki collects logs from every container via Alloy (successor to Promtail). The key insight: don't index log content. Loki indexes labels (container name, compose project, severity) and searches content at query time. This keeps storage costs manageable even at high log volumes.
# Example LogQL query: Find error logs from a specific container
{container_name="ghost-blog"} |= "error" | json | level="error"
# Aggregate error rate over time
sum(rate({job="docker"} |= "error" [5m])) by (container_name)
External Monitoring with Uptime Kuma
Prometheus monitors from the inside. Uptime Kuma monitors from the outside — 67 monitors checking HTTP endpoints, TCP ports, DNS resolution, and SSL certificate expiry. If Prometheus goes down, Uptime Kuma still catches it. If Uptime Kuma goes down, Prometheus alerts on it. Redundancy in monitoring is non-negotiable.
Lessons Learned
- Start with alerts, not dashboards. A beautiful dashboard nobody watches is worthless. Alerts that wake you up at 3am are real.
- Retention matters. I keep 15 days of full-resolution metrics, 90 days of downsampled data. Adjust based on your storage budget.
- Label cardinality kills Prometheus. Don't create labels with unbounded values (like request URLs). Your TSDB will explode.
- Monitor the monitors. Deadman alerts ensure your alerting pipeline itself is healthy.
- Document your dashboards. Future you won't remember what that custom panel formula means.
The monitoring stack is the foundation of everything else. Without visibility, you can't optimize, secure, or scale. It was the first thing I built and the last thing I'd ever remove.