Monitoring Infrastructure

This document outlines the monitoring and observability infrastructure for the AI Agent Orchestration Platform.

Overview

The platform implements a comprehensive monitoring and observability stack to ensure reliability, performance, and security. This infrastructure provides real-time insights into system health, performance metrics, and user behavior.

Monitoring Architecture

The monitoring architecture consists of several components:

Metrics Collection: Gather performance and health metrics
Logging: Collect and aggregate logs
Tracing: Track requests across services
Alerting: Notify team of issues
Dashboards: Visualize system status
Anomaly Detection: Identify unusual patterns

Monitoring Architecture Diagram

Note: This is a placeholder for a monitoring architecture diagram. The actual diagram should be created and added to the project.

Metrics Collection

Prometheus

Prometheus is the primary metrics collection system:

Scrape Configuration: Collect metrics from services
Service Discovery: Automatically find services to monitor
Storage: Time-series database for metrics
Query Language: PromQL for data analysis
Alerting: Define alert conditions

Example Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'backend'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['backend:8000']

  - job_name: 'temporal'
    static_configs:
      - targets: ['temporal:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

Key Metrics

The platform tracks these key metrics:

System Metrics:
CPU usage
Memory usage
Disk I/O
Network traffic
Application Metrics:
Request rate
Error rate
Response time
Queue length
Active workflows
Agent execution time
Database query performance
Business Metrics:
Active users
Workflow completions
Agent usage
HITL response time
Marketplace activity

Logging Infrastructure

Loki

Loki is used for log aggregation:

Log Collection: Gather logs from all services
Log Storage: Efficient storage of log data
Log Query: LogQL for searching logs
Log Visualization: Grafana for log display

Example Loki configuration:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Structured Logging

All services implement structured logging:

JSON Format: Machine-readable log entries
Correlation IDs: Track requests across services
Log Levels: Debug, Info, Warning, Error, Critical
Contextual Information: Include relevant context in logs

Example structured log entry:

{
  "timestamp": "2025-04-18T14:30:45.123Z",
  "level": "info",
  "service": "backend",
  "trace_id": "abcdef123456",
  "user_id": "user-123",
  "message": "Workflow execution started",
  "workflow_id": "wf-456",
  "agent_count": 3,
  "duration_ms": 45
}

Distributed Tracing

Jaeger

Jaeger is used for distributed tracing:

Trace Collection: Gather traces from services
Trace Storage: Store trace data
Trace Visualization: Jaeger UI for trace display
Trace Analysis: Identify performance bottlenecks

Example Jaeger configuration:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: meta-agent-jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
    hosts:
      - jaeger.meta-agent.example.com
  ui:
    options:
      dependencies:
        menuEnabled: true
      tracking:
        gaID: UA-000000-0
  agent:
    strategy: DaemonSet

OpenTelemetry Integration

The platform uses OpenTelemetry for instrumentation:

Trace Context Propagation: Pass context between services
Automatic Instrumentation: Add tracing to common libraries
Manual Instrumentation: Add custom spans for business logic
Sampling: Configure trace sampling rate

Alerting System

Alertmanager

Alertmanager handles alerting:

Alert Routing: Send alerts to appropriate channels
Alert Grouping: Group related alerts
Alert Silencing: Temporarily disable alerts
Alert Inhibition: Prevent redundant alerts

Example Alertmanager configuration:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#monitoring'
    send_resolved: true
    title: '{{ template "slack.default.title" . }}'
    text: '{{ template "slack.default.text" . }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'
    send_resolved: true

templates:
- '/etc/alertmanager/template/*.tmpl'

Alert Rules

Example alert rules:

groups:
- name: meta-agent-alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for 5 minutes (current value: {{ $value }})"

  - alert: HighLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is above 1 second for 5 minutes (current value: {{ $value }}s)"

  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for 5 minutes (current value: {{ $value | humanizePercentage }})"

Visualization

Grafana

Grafana is used for visualization:

Dashboards: Visual representation of metrics
Alerts: Visual alert management
Data Sources: Connect to Prometheus, Loki, Jaeger
User Management: Control dashboard access

Example Grafana dashboard configuration:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {}
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.3.7",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (status)",
          "interval": "",
          "legendFormat": "{{status}}",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "HTTP Request Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 26,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Meta Agent Platform Overview",
  "uid": "meta-agent-overview",
  "version": 1
}

Key Dashboards

The platform includes these key dashboards:

Platform Overview: High-level system health
Service Performance: Detailed service metrics
Workflow Execution: Workflow performance and status
Agent Metrics: Agent execution statistics
User Activity: User behavior and engagement
Resource Usage: Infrastructure resource utilization
Error Analysis: Error patterns and trends
SLO/SLI Tracking: Service level objectives and indicators

Anomaly Detection

The platform includes anomaly detection:

Machine Learning Models: Detect unusual patterns
Baseline Comparison: Compare current to historical metrics
Trend Analysis: Identify concerning trends
Correlation: Find related anomalies

Special monitoring for multi-modal agents:

Vision Agent Metrics: Image processing time, accuracy
Audio Agent Metrics: Speech recognition accuracy, processing time
Sensor Data Metrics: Data throughput, processing latency

Edge Monitoring

For edge deployments, specialized monitoring:

Edge Device Health: CPU, memory, disk, network
Connectivity Status: Online/offline status
Sync Status: Data synchronization status
Resource Constraints: Battery level, storage capacity

Federated Monitoring

For federated deployments, specialized monitoring:

Cross-Organization Workflows: End-to-end performance
Data Transfer Metrics: Volume, latency, success rate
Privacy Compliance: Data access patterns
Secure Computation: Performance of privacy-preserving computation

Monitoring Scripts

Scripts for monitoring management are located in /infra/scripts/:

setup_monitoring.sh - Set up monitoring stack
backup_dashboards.sh - Backup Grafana dashboards
restore_dashboards.sh - Restore Grafana dashboards
alert_test.sh - Test alerting system

Best Practices

Implement comprehensive instrumentation
Use structured logging
Correlate logs, metrics, and traces
Set up meaningful alerts
Create actionable dashboards
Monitor from the user perspective
Implement SLOs and SLIs
Regularly review and improve monitoring

References

Last updated: 2025-04-18