Skip to content

Monitoring Infrastructure

This document outlines the monitoring and observability infrastructure for the AI Agent Orchestration Platform.

Overview

The platform implements a comprehensive monitoring and observability stack to ensure reliability, performance, and security. This infrastructure provides real-time insights into system health, performance metrics, and user behavior.

Monitoring Architecture

The monitoring architecture consists of several components:

  1. Metrics Collection: Gather performance and health metrics
  2. Logging: Collect and aggregate logs
  3. Tracing: Track requests across services
  4. Alerting: Notify team of issues
  5. Dashboards: Visualize system status
  6. Anomaly Detection: Identify unusual patterns

Monitoring Architecture Diagram

Note: This is a placeholder for a monitoring architecture diagram. The actual diagram should be created and added to the project.

Metrics Collection

Prometheus

Prometheus is the primary metrics collection system:

  • Scrape Configuration: Collect metrics from services
  • Service Discovery: Automatically find services to monitor
  • Storage: Time-series database for metrics
  • Query Language: PromQL for data analysis
  • Alerting: Define alert conditions

Example Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'backend'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['backend:8000']

  - job_name: 'temporal'
    static_configs:
      - targets: ['temporal:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

Key Metrics

The platform tracks these key metrics:

  • System Metrics:
  • CPU usage
  • Memory usage
  • Disk I/O
  • Network traffic

  • Application Metrics:

  • Request rate
  • Error rate
  • Response time
  • Queue length
  • Active workflows
  • Agent execution time
  • Database query performance

  • Business Metrics:

  • Active users
  • Workflow completions
  • Agent usage
  • HITL response time
  • Marketplace activity

Logging Infrastructure

Loki

Loki is used for log aggregation:

  • Log Collection: Gather logs from all services
  • Log Storage: Efficient storage of log data
  • Log Query: LogQL for searching logs
  • Log Visualization: Grafana for log display

Example Loki configuration:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Structured Logging

All services implement structured logging:

  • JSON Format: Machine-readable log entries
  • Correlation IDs: Track requests across services
  • Log Levels: Debug, Info, Warning, Error, Critical
  • Contextual Information: Include relevant context in logs

Example structured log entry:

{
  "timestamp": "2025-04-18T14:30:45.123Z",
  "level": "info",
  "service": "backend",
  "trace_id": "abcdef123456",
  "user_id": "user-123",
  "message": "Workflow execution started",
  "workflow_id": "wf-456",
  "agent_count": 3,
  "duration_ms": 45
}

Distributed Tracing

Jaeger

Jaeger is used for distributed tracing:

  • Trace Collection: Gather traces from services
  • Trace Storage: Store trace data
  • Trace Visualization: Jaeger UI for trace display
  • Trace Analysis: Identify performance bottlenecks

Example Jaeger configuration:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: meta-agent-jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
    hosts:
      - jaeger.meta-agent.example.com
  ui:
    options:
      dependencies:
        menuEnabled: true
      tracking:
        gaID: UA-000000-0
  agent:
    strategy: DaemonSet

OpenTelemetry Integration

The platform uses OpenTelemetry for instrumentation:

  • Trace Context Propagation: Pass context between services
  • Automatic Instrumentation: Add tracing to common libraries
  • Manual Instrumentation: Add custom spans for business logic
  • Sampling: Configure trace sampling rate

Alerting System

Alertmanager

Alertmanager handles alerting:

  • Alert Routing: Send alerts to appropriate channels
  • Alert Grouping: Group related alerts
  • Alert Silencing: Temporarily disable alerts
  • Alert Inhibition: Prevent redundant alerts

Example Alertmanager configuration:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#monitoring'
    send_resolved: true
    title: '{{ template "slack.default.title" . }}'
    text: '{{ template "slack.default.text" . }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'
    send_resolved: true

templates:
- '/etc/alertmanager/template/*.tmpl'

Alert Rules

Example alert rules:

groups:
- name: meta-agent-alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for 5 minutes (current value: {{ $value }})"

  - alert: HighLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is above 1 second for 5 minutes (current value: {{ $value }}s)"

  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for 5 minutes (current value: {{ $value | humanizePercentage }})"

Visualization

Grafana

Grafana is used for visualization:

  • Dashboards: Visual representation of metrics
  • Alerts: Visual alert management
  • Data Sources: Connect to Prometheus, Loki, Jaeger
  • User Management: Control dashboard access

Example Grafana dashboard configuration:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.3.7",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (status)",
          "interval": "",
          "legendFormat": "{{status}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "HTTP Request Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 26,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Meta Agent Platform Overview",
  "uid": "meta-agent-overview",
  "version": 1
}

Key Dashboards

The platform includes these key dashboards:

  • Platform Overview: High-level system health
  • Service Performance: Detailed service metrics
  • Workflow Execution: Workflow performance and status
  • Agent Metrics: Agent execution statistics
  • User Activity: User behavior and engagement
  • Resource Usage: Infrastructure resource utilization
  • Error Analysis: Error patterns and trends
  • SLO/SLI Tracking: Service level objectives and indicators

Anomaly Detection

The platform includes anomaly detection:

  • Machine Learning Models: Detect unusual patterns
  • Baseline Comparison: Compare current to historical metrics
  • Trend Analysis: Identify concerning trends
  • Correlation: Find related anomalies

Monitoring for Multi-Modal Agents

Special monitoring for multi-modal agents:

  • Vision Agent Metrics: Image processing time, accuracy
  • Audio Agent Metrics: Speech recognition accuracy, processing time
  • Sensor Data Metrics: Data throughput, processing latency

Edge Monitoring

For edge deployments, specialized monitoring:

  • Edge Device Health: CPU, memory, disk, network
  • Connectivity Status: Online/offline status
  • Sync Status: Data synchronization status
  • Resource Constraints: Battery level, storage capacity

Federated Monitoring

For federated deployments, specialized monitoring:

  • Cross-Organization Workflows: End-to-end performance
  • Data Transfer Metrics: Volume, latency, success rate
  • Privacy Compliance: Data access patterns
  • Secure Computation: Performance of privacy-preserving computation

Monitoring Scripts

Scripts for monitoring management are located in /infra/scripts/:

  • setup_monitoring.sh - Set up monitoring stack
  • backup_dashboards.sh - Backup Grafana dashboards
  • restore_dashboards.sh - Restore Grafana dashboards
  • alert_test.sh - Test alerting system

Best Practices

  • Implement comprehensive instrumentation
  • Use structured logging
  • Correlate logs, metrics, and traces
  • Set up meaningful alerts
  • Create actionable dashboards
  • Monitor from the user perspective
  • Implement SLOs and SLIs
  • Regularly review and improve monitoring

References


Last updated: 2025-04-18