Técnico25 min

Monitoreo y Observabilidad Cloud Native

Implementa un stack completo de observabilidad con Prometheus, Grafana, Loki y Jaeger. Métricas, logs y traces en un solo lugar.

MT

Miguel Torres

SRE Lead

Observabilidad: Los Tres Pilares del Monitoreo Moderno

En InfraUX, la observabilidad no es una opción, es una necesidad. Este es nuestro enfoque para implementar monitoreo completo en arquitecturas cloud native.

Los Tres Pilares de la Observabilidad

<div class="info-box"> 📊 **Métricas**: Datos numéricos agregados en el tiempo (CPU, memoria, latencia) 📝 **Logs**: Eventos discretos con contexto detallado 🔍 **Traces**: Seguimiento de requests a través de sistemas distribuidos </div>

Stack de Observabilidad Completo

1# docker-compose.yml - Stack local para desarrollo 2version: '3.8' 3 4services: 5 # Métricas 6 prometheus: 7 image: prom/prometheus:latest 8 volumes: 9 - ./prometheus.yml:/etc/prometheus/prometheus.yml 10 - prometheus_data:/prometheus 11 command: 12 - '--config.file=/etc/prometheus/prometheus.yml' 13 - '--storage.tsdb.path=/prometheus' 14 - '--web.console.libraries=/usr/share/prometheus/console_libraries' 15 - '--web.console.templates=/usr/share/prometheus/consoles' 16 ports: 17 - "9090:9090" 18 19 # Visualización 20 grafana: 21 image: grafana/grafana:latest 22 environment: 23 - GF_SECURITY_ADMIN_PASSWORD=admin 24 - GF_USERS_ALLOW_SIGN_UP=false 25 volumes: 26 - grafana_data:/var/lib/grafana 27 - ./grafana/provisioning:/etc/grafana/provisioning 28 ports: 29 - "3000:3000" 30 31 # Logs 32 loki: 33 image: grafana/loki:latest 34 ports: 35 - "3100:3100" 36 command: -config.file=/etc/loki/local-config.yaml 37 volumes: 38 - loki_data:/loki 39 40 # Traces 41 jaeger: 42 image: jaegertracing/all-in-one:latest 43 environment: 44 - COLLECTOR_ZIPKIN_HOST_PORT=:9411 45 ports: 46 - "5775:5775/udp" 47 - "6831:6831/udp" 48 - "6832:6832/udp" 49 - "5778:5778" 50 - "16686:16686" 51 - "14268:14268" 52 - "14250:14250" 53 - "9411:9411" 54 55volumes: 56 prometheus_data: 57 grafana_data: 58 loki_data:

Configuración de Prometheus

1# prometheus.yml 2global: 3 scrape_interval: 15s 4 evaluation_interval: 15s 5 external_labels: 6 cluster: 'production' 7 region: 'us-east-1' 8 9# Alerting 10alerting: 11 alertmanagers: 12 - static_configs: 13 - targets: 14 - alertmanager:9093 15 16# Rules 17rule_files: 18 - "alerts/*.yml" 19 - "recording/*.yml" 20 21# Scrape configs 22scrape_configs: 23 # Prometheus self-monitoring 24 - job_name: 'prometheus' 25 static_configs: 26 - targets: ['localhost:9090'] 27 28 # Node Exporter 29 - job_name: 'node' 30 static_configs: 31 - targets: ['node-exporter:9100'] 32 33 # Kubernetes metrics 34 - job_name: 'kubernetes-apiservers' 35 kubernetes_sd_configs: 36 - role: endpoints 37 scheme: https 38 tls_config: 39 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 40 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 41 relabel_configs: 42 - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 43 action: keep 44 regex: default;kubernetes;https 45 46 # Application metrics 47 - job_name: 'apps' 48 kubernetes_sd_configs: 49 - role: pod 50 relabel_configs: 51 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] 52 action: keep 53 regex: true 54 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] 55 action: replace 56 target_label: __metrics_path__ 57 regex: (.+)

Instrumentación de Aplicaciones

1. Métricas con Prometheus (Node.js)

1// metrics.js 2const client = require('prom-client'); 3const express = require('express'); 4const app = express(); 5 6// Crear registro 7const register = new client.Registry(); 8 9// Métricas predefinidas (CPU, memoria, etc.) 10client.collectDefaultMetrics({ register }); 11 12// Métricas custom 13const httpRequestDuration = new client.Histogram({ 14 name: 'http_request_duration_seconds', 15 help: 'Duration of HTTP requests in seconds', 16 labelNames: ['method', 'route', 'status_code'], 17 buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10] 18}); 19register.registerMetric(httpRequestDuration); 20 21const httpRequestTotal = new client.Counter({ 22 name: 'http_requests_total', 23 help: 'Total number of HTTP requests', 24 labelNames: ['method', 'route', 'status_code'] 25}); 26register.registerMetric(httpRequestTotal); 27 28// Middleware para tracking 29app.use((req, res, next) => { 30 const start = Date.now(); 31 32 res.on('finish', () => { 33 const duration = Date.now() - start; 34 httpRequestDuration 35 .labels(req.method, req.route?.path || req.path, res.statusCode) 36 .observe(duration / 1000); 37 38 httpRequestTotal 39 .labels(req.method, req.route?.path || req.path, res.statusCode) 40 .inc(); 41 }); 42 43 next(); 44}); 45 46// Endpoint de métricas 47app.get('/metrics', async (req, res) => { 48 res.set('Content-Type', register.contentType); 49 res.end(await register.metrics()); 50});

2. Logging Estructurado con Loki

1// logging.js 2const winston = require('winston'); 3const LokiTransport = require('winston-loki'); 4 5const logger = winston.createLogger({ 6 format: winston.format.json(), 7 defaultMeta: { 8 service: 'api-service', 9 environment: process.env.NODE_ENV 10 }, 11 transports: [ 12 new LokiTransport({ 13 host: 'http://loki:3100', 14 labels: { job: 'api-service' }, 15 json: true, 16 format: winston.format.json(), 17 replaceTimestamp: true, 18 onConnectionError: (err) => console.error(err) 19 }), 20 new winston.transports.Console({ 21 format: winston.format.combine( 22 winston.format.colorize(), 23 winston.format.simple() 24 ) 25 }) 26 ] 27}); 28 29// Uso 30logger.info('User logged in', { 31 userId: user.id, 32 email: user.email, 33 ip: req.ip, 34 userAgent: req.headers['user-agent'] 35}); 36 37// Log de errores con contexto 38logger.error('Database connection failed', { 39 error: err.message, 40 stack: err.stack, 41 query: query, 42 duration: Date.now() - startTime 43});

3. Distributed Tracing con Jaeger

1// tracing.js 2const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); 3const { JaegerExporter } = require('@opentelemetry/exporter-jaeger'); 4const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); 5const { registerInstrumentations } = require('@opentelemetry/instrumentation'); 6const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http'); 7const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express'); 8 9// Configurar provider 10const provider = new NodeTracerProvider({ 11 resource: { 12 attributes: { 13 'service.name': 'api-service', 14 'service.version': '1.0.0', 15 'deployment.environment': process.env.NODE_ENV 16 } 17 } 18}); 19 20// Configurar exporter 21const jaegerExporter = new JaegerExporter({ 22 endpoint: 'http://jaeger:14268/api/traces', 23}); 24 25// Agregar processor 26provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter)); 27 28// Registrar provider 29provider.register(); 30 31// Auto-instrumentación 32registerInstrumentations({ 33 instrumentations: [ 34 new HttpInstrumentation({ 35 requestHook: (span, request) => { 36 span.setAttributes({ 37 'http.request.body': JSON.stringify(request.body) 38 }); 39 } 40 }), 41 new ExpressInstrumentation(), 42 ], 43}); 44 45// Tracing manual 46const tracer = provider.getTracer('api-service'); 47 48async function processPayment(userId, amount) { 49 const span = tracer.startSpan('process-payment', { 50 attributes: { 51 'user.id': userId, 52 'payment.amount': amount, 53 'payment.currency': 'USD' 54 } 55 }); 56 57 try { 58 // Validación 59 const validationSpan = tracer.startSpan('validate-payment', { parent: span }); 60 await validatePayment(userId, amount); 61 validationSpan.end(); 62 63 // Procesamiento 64 const processingSpan = tracer.startSpan('charge-payment', { parent: span }); 65 const result = await chargeCard(userId, amount); 66 processingSpan.end(); 67 68 span.setStatus({ code: 1 }); // OK 69 return result; 70 } catch (error) { 71 span.recordException(error); 72 span.setStatus({ code: 2, message: error.message }); // ERROR 73 throw error; 74 } finally { 75 span.end(); 76 } 77}

Dashboards de Grafana

Dashboard de Aplicación

1{ 2 "dashboard": { 3 "title": "Application Performance", 4 "panels": [ 5 { 6 "title": "Request Rate", 7 "targets": [{ 8 "expr": "sum(rate(http_requests_total[5m])) by (method, route)" 9 }], 10 "type": "graph" 11 }, 12 { 13 "title": "Response Time P95", 14 "targets": [{ 15 "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))" 16 }], 17 "type": "graph" 18 }, 19 { 20 "title": "Error Rate", 21 "targets": [{ 22 "expr": "sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))" 23 }], 24 "type": "stat" 25 }, 26 { 27 "title": "Active Connections", 28 "targets": [{ 29 "expr": "nodejs_active_handles_total" 30 }], 31 "type": "gauge" 32 } 33 ] 34 } 35}

Alertas Inteligentes

1# alerts/application.yml 2groups: 3 - name: application 4 interval: 30s 5 rules: 6 - alert: HighErrorRate 7 expr: | 8 sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service) 9 / 10 sum(rate(http_requests_total[5m])) by (service) 11 > 0.05 12 for: 5m 13 labels: 14 severity: critical 15 team: backend 16 annotations: 17 summary: "High error rate on {{ $labels.service }}" 18 description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes" 19 20 - alert: HighResponseTime 21 expr: | 22 histogram_quantile(0.95, 23 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) 24 ) > 1 25 for: 10m 26 labels: 27 severity: warning 28 annotations: 29 summary: "High response time on {{ $labels.service }}" 30 description: "95th percentile response time is {{ $value }}s" 31 32 - alert: PodCrashLooping 33 expr: | 34 rate(kube_pod_container_status_restarts_total[15m]) > 0 35 for: 5m 36 labels: 37 severity: critical 38 annotations: 39 summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" 40 description: "Pod has restarted {{ $value }} times in the last 15 minutes"

Correlación de Datos

1// Correlación entre métricas, logs y traces 2class ObservabilityContext { 3 constructor() { 4 this.traceId = this.generateTraceId(); 5 } 6 7 // Agregar trace ID a logs 8 enrichLog(logData) { 9 return { 10 ...logData, 11 traceId: this.traceId, 12 spanId: this.currentSpanId, 13 timestamp: new Date().toISOString() 14 }; 15 } 16 17 // Agregar trace ID a métricas 18 enrichMetric(labels) { 19 return { 20 ...labels, 21 trace_id: this.traceId 22 }; 23 } 24 25 // Middleware Express 26 middleware() { 27 return (req, res, next) => { 28 req.observability = new ObservabilityContext(); 29 30 // Propagar trace ID en headers 31 res.setHeader('X-Trace-Id', req.observability.traceId); 32 33 next(); 34 }; 35 } 36}

Mejores Prácticas

PrácticaDescripciónBeneficio
USE MethodUtilization, Saturation, ErrorsDiagnóstico rápido
RED MethodRate, Errors, DurationMétricas de servicio
Golden SignalsLatency, Traffic, Errors, SaturationSLIs completos
Cardinality ControlLimitar labels únicosCostos controlados

Optimización de Costos

<div class="warning-box"> ⚠️ **Control de Cardinality**: Evita labels con alta cardinalidad (IDs únicos, timestamps) que pueden explotar los costos de almacenamiento. </div>
1// ❌ MAL: Alta cardinalidad 2httpRequestTotal.labels(userId, timestamp, sessionId).inc(); 3 4// ✅ BIEN: Cardinalidad controlada 5httpRequestTotal.labels(userType, endpoint, statusCode).inc();

ROI de Observabilidad

  • MTTR reducido en 70%: De 4 horas a 1.2 horas promedio
  • Incidentes prevenidos: 40% menos incidentes en producción
  • Ahorro en costos: $50k/mes en tiempo de ingeniería

La observabilidad no es un gasto, es una inversión en la confiabilidad y eficiencia de tu infraestructura.

#monitoreo#observabilidad#prometheus#grafana#jaeger