Técnico••25 min
Monitoreo y Observabilidad Cloud Native
Implementa un stack completo de observabilidad con Prometheus, Grafana, Loki y Jaeger. Métricas, logs y traces en un solo lugar.
MT
Miguel Torres
SRE Lead
Observabilidad: Los Tres Pilares del Monitoreo Moderno
En InfraUX, la observabilidad no es una opción, es una necesidad. Este es nuestro enfoque para implementar monitoreo completo en arquitecturas cloud native.
Los Tres Pilares de la Observabilidad
<div class="info-box"> 📊 **Métricas**: Datos numéricos agregados en el tiempo (CPU, memoria, latencia) 📝 **Logs**: Eventos discretos con contexto detallado 🔍 **Traces**: Seguimiento de requests a través de sistemas distribuidos </div>Stack de Observabilidad Completo
1# docker-compose.yml - Stack local para desarrollo
2version: '3.8'
3
4services:
5 # Métricas
6 prometheus:
7 image: prom/prometheus:latest
8 volumes:
9 - ./prometheus.yml:/etc/prometheus/prometheus.yml
10 - prometheus_data:/prometheus
11 command:
12 - '--config.file=/etc/prometheus/prometheus.yml'
13 - '--storage.tsdb.path=/prometheus'
14 - '--web.console.libraries=/usr/share/prometheus/console_libraries'
15 - '--web.console.templates=/usr/share/prometheus/consoles'
16 ports:
17 - "9090:9090"
18
19 # Visualización
20 grafana:
21 image: grafana/grafana:latest
22 environment:
23 - GF_SECURITY_ADMIN_PASSWORD=admin
24 - GF_USERS_ALLOW_SIGN_UP=false
25 volumes:
26 - grafana_data:/var/lib/grafana
27 - ./grafana/provisioning:/etc/grafana/provisioning
28 ports:
29 - "3000:3000"
30
31 # Logs
32 loki:
33 image: grafana/loki:latest
34 ports:
35 - "3100:3100"
36 command: -config.file=/etc/loki/local-config.yaml
37 volumes:
38 - loki_data:/loki
39
40 # Traces
41 jaeger:
42 image: jaegertracing/all-in-one:latest
43 environment:
44 - COLLECTOR_ZIPKIN_HOST_PORT=:9411
45 ports:
46 - "5775:5775/udp"
47 - "6831:6831/udp"
48 - "6832:6832/udp"
49 - "5778:5778"
50 - "16686:16686"
51 - "14268:14268"
52 - "14250:14250"
53 - "9411:9411"
54
55volumes:
56 prometheus_data:
57 grafana_data:
58 loki_data:
Configuración de Prometheus
1# prometheus.yml
2global:
3 scrape_interval: 15s
4 evaluation_interval: 15s
5 external_labels:
6 cluster: 'production'
7 region: 'us-east-1'
8
9# Alerting
10alerting:
11 alertmanagers:
12 - static_configs:
13 - targets:
14 - alertmanager:9093
15
16# Rules
17rule_files:
18 - "alerts/*.yml"
19 - "recording/*.yml"
20
21# Scrape configs
22scrape_configs:
23 # Prometheus self-monitoring
24 - job_name: 'prometheus'
25 static_configs:
26 - targets: ['localhost:9090']
27
28 # Node Exporter
29 - job_name: 'node'
30 static_configs:
31 - targets: ['node-exporter:9100']
32
33 # Kubernetes metrics
34 - job_name: 'kubernetes-apiservers'
35 kubernetes_sd_configs:
36 - role: endpoints
37 scheme: https
38 tls_config:
39 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
40 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
41 relabel_configs:
42 - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
43 action: keep
44 regex: default;kubernetes;https
45
46 # Application metrics
47 - job_name: 'apps'
48 kubernetes_sd_configs:
49 - role: pod
50 relabel_configs:
51 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
52 action: keep
53 regex: true
54 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
55 action: replace
56 target_label: __metrics_path__
57 regex: (.+)
Instrumentación de Aplicaciones
1. Métricas con Prometheus (Node.js)
1// metrics.js
2const client = require('prom-client');
3const express = require('express');
4const app = express();
5
6// Crear registro
7const register = new client.Registry();
8
9// Métricas predefinidas (CPU, memoria, etc.)
10client.collectDefaultMetrics({ register });
11
12// Métricas custom
13const httpRequestDuration = new client.Histogram({
14 name: 'http_request_duration_seconds',
15 help: 'Duration of HTTP requests in seconds',
16 labelNames: ['method', 'route', 'status_code'],
17 buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
18});
19register.registerMetric(httpRequestDuration);
20
21const httpRequestTotal = new client.Counter({
22 name: 'http_requests_total',
23 help: 'Total number of HTTP requests',
24 labelNames: ['method', 'route', 'status_code']
25});
26register.registerMetric(httpRequestTotal);
27
28// Middleware para tracking
29app.use((req, res, next) => {
30 const start = Date.now();
31
32 res.on('finish', () => {
33 const duration = Date.now() - start;
34 httpRequestDuration
35 .labels(req.method, req.route?.path || req.path, res.statusCode)
36 .observe(duration / 1000);
37
38 httpRequestTotal
39 .labels(req.method, req.route?.path || req.path, res.statusCode)
40 .inc();
41 });
42
43 next();
44});
45
46// Endpoint de métricas
47app.get('/metrics', async (req, res) => {
48 res.set('Content-Type', register.contentType);
49 res.end(await register.metrics());
50});
2. Logging Estructurado con Loki
1// logging.js
2const winston = require('winston');
3const LokiTransport = require('winston-loki');
4
5const logger = winston.createLogger({
6 format: winston.format.json(),
7 defaultMeta: {
8 service: 'api-service',
9 environment: process.env.NODE_ENV
10 },
11 transports: [
12 new LokiTransport({
13 host: 'http://loki:3100',
14 labels: { job: 'api-service' },
15 json: true,
16 format: winston.format.json(),
17 replaceTimestamp: true,
18 onConnectionError: (err) => console.error(err)
19 }),
20 new winston.transports.Console({
21 format: winston.format.combine(
22 winston.format.colorize(),
23 winston.format.simple()
24 )
25 })
26 ]
27});
28
29// Uso
30logger.info('User logged in', {
31 userId: user.id,
32 email: user.email,
33 ip: req.ip,
34 userAgent: req.headers['user-agent']
35});
36
37// Log de errores con contexto
38logger.error('Database connection failed', {
39 error: err.message,
40 stack: err.stack,
41 query: query,
42 duration: Date.now() - startTime
43});
3. Distributed Tracing con Jaeger
1// tracing.js
2const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
3const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
4const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
5const { registerInstrumentations } = require('@opentelemetry/instrumentation');
6const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
7const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
8
9// Configurar provider
10const provider = new NodeTracerProvider({
11 resource: {
12 attributes: {
13 'service.name': 'api-service',
14 'service.version': '1.0.0',
15 'deployment.environment': process.env.NODE_ENV
16 }
17 }
18});
19
20// Configurar exporter
21const jaegerExporter = new JaegerExporter({
22 endpoint: 'http://jaeger:14268/api/traces',
23});
24
25// Agregar processor
26provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));
27
28// Registrar provider
29provider.register();
30
31// Auto-instrumentación
32registerInstrumentations({
33 instrumentations: [
34 new HttpInstrumentation({
35 requestHook: (span, request) => {
36 span.setAttributes({
37 'http.request.body': JSON.stringify(request.body)
38 });
39 }
40 }),
41 new ExpressInstrumentation(),
42 ],
43});
44
45// Tracing manual
46const tracer = provider.getTracer('api-service');
47
48async function processPayment(userId, amount) {
49 const span = tracer.startSpan('process-payment', {
50 attributes: {
51 'user.id': userId,
52 'payment.amount': amount,
53 'payment.currency': 'USD'
54 }
55 });
56
57 try {
58 // Validación
59 const validationSpan = tracer.startSpan('validate-payment', { parent: span });
60 await validatePayment(userId, amount);
61 validationSpan.end();
62
63 // Procesamiento
64 const processingSpan = tracer.startSpan('charge-payment', { parent: span });
65 const result = await chargeCard(userId, amount);
66 processingSpan.end();
67
68 span.setStatus({ code: 1 }); // OK
69 return result;
70 } catch (error) {
71 span.recordException(error);
72 span.setStatus({ code: 2, message: error.message }); // ERROR
73 throw error;
74 } finally {
75 span.end();
76 }
77}
Dashboards de Grafana
Dashboard de Aplicación
1{
2 "dashboard": {
3 "title": "Application Performance",
4 "panels": [
5 {
6 "title": "Request Rate",
7 "targets": [{
8 "expr": "sum(rate(http_requests_total[5m])) by (method, route)"
9 }],
10 "type": "graph"
11 },
12 {
13 "title": "Response Time P95",
14 "targets": [{
15 "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
16 }],
17 "type": "graph"
18 },
19 {
20 "title": "Error Rate",
21 "targets": [{
22 "expr": "sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))"
23 }],
24 "type": "stat"
25 },
26 {
27 "title": "Active Connections",
28 "targets": [{
29 "expr": "nodejs_active_handles_total"
30 }],
31 "type": "gauge"
32 }
33 ]
34 }
35}
Alertas Inteligentes
1# alerts/application.yml
2groups:
3 - name: application
4 interval: 30s
5 rules:
6 - alert: HighErrorRate
7 expr: |
8 sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
9 /
10 sum(rate(http_requests_total[5m])) by (service)
11 > 0.05
12 for: 5m
13 labels:
14 severity: critical
15 team: backend
16 annotations:
17 summary: "High error rate on {{ $labels.service }}"
18 description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
19
20 - alert: HighResponseTime
21 expr: |
22 histogram_quantile(0.95,
23 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
24 ) > 1
25 for: 10m
26 labels:
27 severity: warning
28 annotations:
29 summary: "High response time on {{ $labels.service }}"
30 description: "95th percentile response time is {{ $value }}s"
31
32 - alert: PodCrashLooping
33 expr: |
34 rate(kube_pod_container_status_restarts_total[15m]) > 0
35 for: 5m
36 labels:
37 severity: critical
38 annotations:
39 summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
40 description: "Pod has restarted {{ $value }} times in the last 15 minutes"
Correlación de Datos
1// Correlación entre métricas, logs y traces
2class ObservabilityContext {
3 constructor() {
4 this.traceId = this.generateTraceId();
5 }
6
7 // Agregar trace ID a logs
8 enrichLog(logData) {
9 return {
10 ...logData,
11 traceId: this.traceId,
12 spanId: this.currentSpanId,
13 timestamp: new Date().toISOString()
14 };
15 }
16
17 // Agregar trace ID a métricas
18 enrichMetric(labels) {
19 return {
20 ...labels,
21 trace_id: this.traceId
22 };
23 }
24
25 // Middleware Express
26 middleware() {
27 return (req, res, next) => {
28 req.observability = new ObservabilityContext();
29
30 // Propagar trace ID en headers
31 res.setHeader('X-Trace-Id', req.observability.traceId);
32
33 next();
34 };
35 }
36}
Mejores Prácticas
Práctica | Descripción | Beneficio |
---|---|---|
USE Method | Utilization, Saturation, Errors | Diagnóstico rápido |
RED Method | Rate, Errors, Duration | Métricas de servicio |
Golden Signals | Latency, Traffic, Errors, Saturation | SLIs completos |
Cardinality Control | Limitar labels únicos | Costos controlados |
Optimización de Costos
<div class="warning-box"> ⚠️ **Control de Cardinality**: Evita labels con alta cardinalidad (IDs únicos, timestamps) que pueden explotar los costos de almacenamiento. </div>1// ❌ MAL: Alta cardinalidad
2httpRequestTotal.labels(userId, timestamp, sessionId).inc();
3
4// ✅ BIEN: Cardinalidad controlada
5httpRequestTotal.labels(userType, endpoint, statusCode).inc();
ROI de Observabilidad
- MTTR reducido en 70%: De 4 horas a 1.2 horas promedio
- Incidentes prevenidos: 40% menos incidentes en producción
- Ahorro en costos: $50k/mes en tiempo de ingeniería
La observabilidad no es un gasto, es una inversión en la confiabilidad y eficiencia de tu infraestructura.
#monitoreo#observabilidad#prometheus#grafana#jaeger