Monitoring 95+ Docker Containers: A Complete Observability Stack

How I built a comprehensive monitoring stack with Grafana, Prometheus, Loki, and Alertmanager to observe 95+ Docker containers across my home lab.

Monitoring 95+ Docker Containers: A Complete Observability Stack

The Challenge

When you're running 95+ Docker containers across a virtualized infrastructure, visibility becomes critical. Without proper monitoring, you're flying blind — containers crash silently, disks fill up overnight, and performance degrades without anyone noticing.

I needed a monitoring stack that could:

  • Track metrics from every container, VM, and physical node
  • Aggregate logs from all services into a searchable platform
  • Alert me before problems become outages
  • Scale without consuming all my resources
  • Provide at-a-glance dashboards for quick triage

The Stack

After evaluating several options (Datadog was too expensive, Zabbix felt dated), I landed on the Grafana ecosystem:

ComponentRoleWhy I Chose It
PrometheusMetrics collectionPull-based model, battle-tested, massive exporter ecosystem
LokiLog aggregationLogQL is powerful, integrates natively with Grafana, low resource footprint
GrafanaVisualization34 dashboards, flexible alerting, one pane of glass
AlertmanagerAlert routingDeduplication, grouping, silencing — 40+ rules
AlloyTelemetry pipelineReplaced Promtail, handles logs + metrics collection

Metrics Collection Architecture

Prometheus scrapes metrics from a constellation of exporters. Here's how the data flows:

# Example Prometheus scrape config (simplified)
scrape_configs:
  - job_name: 'cadvisor'
    scrape_interval: 15s
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'node-exporter'
    scrape_interval: 30s
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'pve-exporter'
    scrape_interval: 60s
    metrics_path: /pve
    static_configs:
      - targets: ['pve-exporter:9221']

Key exporters in the stack:

  • cAdvisor: Container-level CPU, memory, network, and disk I/O metrics
  • Node Exporter: Host-level system metrics (CPU, memory, filesystem, network)
  • PVE Exporter: Proxmox-specific metrics — VM states, cluster health, Ceph status
  • UnPoller: UniFi network device metrics — AP clients, switch ports, throughput
  • Blackbox Exporter: Endpoint probing — HTTP checks, DNS resolution, SSL expiry
  • Postgres Exporter: Database connection pools, query performance, replication lag

34 Dashboards and Counting

Dashboards are organized by domain. A few highlights:

  • Infrastructure Overview: Single dashboard showing all Proxmox nodes, Ceph health, and VM states
  • Docker Fleet: Container status, restart counts, resource consumption sorted by usage
  • Network Intelligence: UniFi device metrics, client counts per AP, inter-VLAN traffic
  • Security Posture: Wazuh alert trends, IDS hits, failed authentication attempts
  • Storage Health: Ceph OSD utilization, IOPS, recovery progress, pool distribution

Alert Rules That Actually Matter

I started with too many alerts and suffered alert fatigue within a week. The fix was ruthless prioritization. My current 40+ rules follow a tiered approach:

# Example alert rule (simplified)
groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.name }} using >90% memory limit"

Alerts route to different channels based on severity:

  • Critical: Push notification via ntfy (immediate)
  • Warning: Grouped digest every 30 minutes
  • Info: Dashboard annotation only

Log Aggregation with Loki

Loki collects logs from every container via Alloy (successor to Promtail). The key insight: don't index log content. Loki indexes labels (container name, compose project, severity) and searches content at query time. This keeps storage costs manageable even at high log volumes.

# Example LogQL query: Find error logs from a specific container
{container_name="ghost-blog"} |= "error" | json | level="error"

# Aggregate error rate over time
sum(rate({job="docker"} |= "error" [5m])) by (container_name)

External Monitoring with Uptime Kuma

Prometheus monitors from the inside. Uptime Kuma monitors from the outside — 67 monitors checking HTTP endpoints, TCP ports, DNS resolution, and SSL certificate expiry. If Prometheus goes down, Uptime Kuma still catches it. If Uptime Kuma goes down, Prometheus alerts on it. Redundancy in monitoring is non-negotiable.

Lessons Learned

  1. Start with alerts, not dashboards. A beautiful dashboard nobody watches is worthless. Alerts that wake you up at 3am are real.
  2. Retention matters. I keep 15 days of full-resolution metrics, 90 days of downsampled data. Adjust based on your storage budget.
  3. Label cardinality kills Prometheus. Don't create labels with unbounded values (like request URLs). Your TSDB will explode.
  4. Monitor the monitors. Deadman alerts ensure your alerting pipeline itself is healthy.
  5. Document your dashboards. Future you won't remember what that custom panel formula means.

The monitoring stack is the foundation of everything else. Without visibility, you can't optimize, secure, or scale. It was the first thing I built and the last thing I'd ever remove.

Link copied