Monitoring

Monitoring 95+ Docker Containers: A Complete Observability Stack

How I built a comprehensive monitoring stack with Grafana, Prometheus, Loki, and Alertmanager to observe 95+ Docker containers across my home lab.

Frank Garofalo

22 Apr 2026 — 3 min read

The Challenge

When you're running 95+ Docker containers across a virtualized infrastructure, visibility becomes critical. Without proper monitoring, you're flying blind — containers crash silently, disks fill up overnight, and performance degrades without anyone noticing.

I needed a monitoring stack that could:

Track metrics from every container, VM, and physical node
Aggregate logs from all services into a searchable platform
Alert me before problems become outages
Scale without consuming all my resources
Provide at-a-glance dashboards for quick triage

The Stack

After evaluating several options (Datadog was too expensive, Zabbix felt dated), I landed on the Grafana ecosystem:

Component	Role	Why I Chose It
Prometheus	Metrics collection	Pull-based model, battle-tested, massive exporter ecosystem
Loki	Log aggregation	LogQL is powerful, integrates natively with Grafana, low resource footprint
Grafana	Visualization	34 dashboards, flexible alerting, one pane of glass
Alertmanager	Alert routing	Deduplication, grouping, silencing — 40+ rules
Alloy	Telemetry pipeline	Replaced Promtail, handles logs + metrics collection

Metrics Collection Architecture

Prometheus scrapes metrics from a constellation of exporters. Here's how the data flows:

# Example Prometheus scrape config (simplified)
scrape_configs:
  - job_name: 'cadvisor'
    scrape_interval: 15s
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'node-exporter'
    scrape_interval: 30s
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'pve-exporter'
    scrape_interval: 60s
    metrics_path: /pve
    static_configs:
      - targets: ['pve-exporter:9221']

Key exporters in the stack:

cAdvisor: Container-level CPU, memory, network, and disk I/O metrics
Node Exporter: Host-level system metrics (CPU, memory, filesystem, network)
PVE Exporter: Proxmox-specific metrics — VM states, cluster health, Ceph status
UnPoller: UniFi network device metrics — AP clients, switch ports, throughput
Blackbox Exporter: Endpoint probing — HTTP checks, DNS resolution, SSL expiry
Postgres Exporter: Database connection pools, query performance, replication lag

34 Dashboards and Counting

Dashboards are organized by domain. A few highlights:

Infrastructure Overview: Single dashboard showing all Proxmox nodes, Ceph health, and VM states
Docker Fleet: Container status, restart counts, resource consumption sorted by usage
Network Intelligence: UniFi device metrics, client counts per AP, inter-VLAN traffic
Security Posture: Wazuh alert trends, IDS hits, failed authentication attempts
Storage Health: Ceph OSD utilization, IOPS, recovery progress, pool distribution

Alert Rules That Actually Matter

I started with too many alerts and suffered alert fatigue within a week. The fix was ruthless prioritization. My current 40+ rules follow a tiered approach:

# Example alert rule (simplified)
groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.name }} using >90% memory limit"

Alerts route to different channels based on severity:

Critical: Push notification via ntfy (immediate)
Warning: Grouped digest every 30 minutes
Info: Dashboard annotation only

Log Aggregation with Loki

Loki collects logs from every container via Alloy (successor to Promtail). The key insight: don't index log content. Loki indexes labels (container name, compose project, severity) and searches content at query time. This keeps storage costs manageable even at high log volumes.

# Example LogQL query: Find error logs from a specific container
{container_name="ghost-blog"} |= "error" | json | level="error"

# Aggregate error rate over time
sum(rate({job="docker"} |= "error" [5m])) by (container_name)

External Monitoring with Uptime Kuma

Prometheus monitors from the inside. Uptime Kuma monitors from the outside — 67 monitors checking HTTP endpoints, TCP ports, DNS resolution, and SSL certificate expiry. If Prometheus goes down, Uptime Kuma still catches it. If Uptime Kuma goes down, Prometheus alerts on it. Redundancy in monitoring is non-negotiable.

Lessons Learned

Start with alerts, not dashboards. A beautiful dashboard nobody watches is worthless. Alerts that wake you up at 3am are real.
Retention matters. I keep 15 days of full-resolution metrics, 90 days of downsampled data. Adjust based on your storage budget.
Label cardinality kills Prometheus. Don't create labels with unbounded values (like request URLs). Your TSDB will explode.
Monitor the monitors. Deadman alerts ensure your alerting pipeline itself is healthy.
Document your dashboards. Future you won't remember what that custom panel formula means.

The monitoring stack is the foundation of everything else. Without visibility, you can't optimize, secure, or scale. It was the first thing I built and the last thing I'd ever remove.