Monitoring and Observability¶
Comprehensive monitoring guide for Route ANS Resolver.
Metrics¶
The resolver exposes Prometheus metrics on /metrics endpoint (default port 9090).
Available Metrics¶
Request Metrics¶
# Total resolution requests
ans_resolver_requests_total{status="success|failure"}
# Request duration histogram
ans_resolver_request_duration_seconds{operation="resolve|batch"}
# Active requests
ans_resolver_active_requests{operation="resolve|batch"}
Cache Metrics¶
# Cache hits/misses
ans_cache_hits_total
ans_cache_misses_total
# Cache size
ans_cache_size_bytes
ans_cache_entries_total
# Cache operations
ans_cache_operations_total{operation="get|set|delete"}
Registry Metrics¶
# Registry lookup duration
ans_registry_lookup_duration_seconds{registry="godaddy|mock"}
# Registry errors
ans_registry_errors_total{registry="godaddy|mock",error_type="timeout|not_found"}
Verification Metrics¶
# Verification operations
ans_verifier_operations_total{result="verified|unverified|error"}
# Verification duration
ans_verifier_duration_seconds
Prometheus Configuration¶
scrape_configs:
- job_name: 'ans-resolver'
static_configs:
- targets: ['localhost:9090']
metrics_path: '/metrics'
scrape_interval: 15s
Kubernetes ServiceMonitor¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ans-resolver
namespace: ans-system
spec:
selector:
matchLabels:
app: ans-resolver
endpoints:
- port: metrics
interval: 30s
path: /metrics
Health Checks¶
Endpoints¶
# Liveness probe
curl http://localhost:8080/health
# Response: {"status":"healthy"}
# Readiness probe
curl http://localhost:8080/ready
# Response: {"status":"ready","checks":{"cache":"ok","registry":"ok"}}
Kubernetes Probes¶
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 2
Logging¶
Log Levels¶
- DEBUG: Detailed troubleshooting info
- INFO: General operational messages
- WARN: Warning conditions
- ERROR: Error conditions
Configuration¶
logging:
level: info # debug, info, warn, error
format: json # json, text
output: stdout # stdout, file
file: /var/log/ans-resolver.log
Log Structure (JSON)¶
{
"time": "2024-01-15T10:30:45Z",
"level": "info",
"msg": "Resolution successful",
"ans_name": "mcp://chatbot.conversation.PID-5678.v1.2.3.example.com",
"version_range": "^1.0.0",
"selected_version": "1.2.3",
"duration_ms": 45,
"cache_hit": true,
"trace_id": "abc123"
}
Alerting¶
Prometheus Alerts¶
groups:
- name: ans-resolver
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(ans_resolver_requests_total{status="failure"}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} req/s"
- alert: HighLatency
expr: histogram_quantile(0.95, ans_resolver_request_duration_seconds_bucket) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: CacheMissRate
expr: rate(ans_cache_misses_total[5m]) / (rate(ans_cache_hits_total[5m]) + rate(ans_cache_misses_total[5m])) > 0.8
for: 10m
labels:
severity: info
annotations:
summary: "High cache miss rate"
description: "Cache miss rate is {{ $value }}"
- alert: ServiceDown
expr: up{job="ans-resolver"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ANS Resolver is down"
AlertManager Configuration¶
route:
receiver: 'team'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
Tracing¶
OpenTelemetry¶
telemetry:
tracing:
enabled: true
provider: otlp # otlp (default)
serviceName: ans-resolver # Service name in traces
sampleRate: 0.1 # 10% sampling (0.0-1.0)
otlp:
endpoint: otel-collector:4317
insecure: true # Use insecure connection (dev only)
headers: "" # Optional custom headers
Jaeger Integration¶
telemetry:
tracing:
enabled: true
provider: jaeger
serviceName: ans-resolver
sampleRate: 1.0
jaeger:
endpoint: jaeger:14268
agentHost: localhost
agentPort: 6831
View traces at:
- Jaeger: http://localhost:16686
- Zipkin: http://localhost:9411
- Grafana Tempo: via Grafana dashboard
Debugging¶
Debug Endpoints¶
# Enable debug logging
curl -X POST http://localhost:8080/debug/log-level?level=debug
# Get pprof profiles
curl http://localhost:8080/debug/pprof/profile > cpu.prof
curl http://localhost:8080/debug/pprof/heap > mem.prof
# Go tool pprof
go tool pprof cpu.prof
Request Tracing¶
Add trace headers for detailed logging:
curl -H "X-Trace-ID: abc123" \
http://localhost:8080/v1/resolve?name=...
Next Steps¶
- Security - Security configuration
- Troubleshooting - Debug common issues
- Deployment - Production deployment