Técnico••18 min
Kubernetes en Producción: Guía de Mejores Prácticas
Todo lo que aprendimos desplegando Kubernetes en producción. Desde configuración de clusters hasta monitoreo, seguridad y troubleshooting.
CR
Carlos Rodríguez
CEO & Co-founder
Kubernetes en Producción: Lecciones del Campo de Batalla
Después de gestionar clusters de Kubernetes que sirven millones de requests diarios, estas son las lecciones más valiosas que hemos aprendido.
Arquitectura del Cluster
Topología Recomendada
1# cluster-topology.yaml
2apiVersion: v1
3kind: Namespace
4metadata:
5 name: production
6 labels:
7 environment: production
8 team: platform
9---
10# Separación por namespaces
11namespaces:
12 - kube-system # Componentes del sistema
13 - kube-public # Recursos públicos
14 - default # No usar en producción
15 - production # Aplicaciones de producción
16 - staging # Ambiente de staging
17 - monitoring # Stack de monitoreo
18 - ingress-nginx # Controlador de ingress
19 - cert-manager # Gestión de certificados
Resource Management
1. Siempre Define Limits y Requests
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: api-server
5 namespace: production
6spec:
7 replicas: 3
8 template:
9 spec:
10 containers:
11 - name: api
12 image: infraux/api:v1.2.3
13 resources:
14 requests:
15 memory: "256Mi"
16 cpu: "250m"
17 limits:
18 memory: "512Mi"
19 cpu: "500m"
20 # Liveness y Readiness son críticos
21 livenessProbe:
22 httpGet:
23 path: /health
24 port: 8080
25 initialDelaySeconds: 30
26 periodSeconds: 10
27 timeoutSeconds: 5
28 failureThreshold: 3
29 readinessProbe:
30 httpGet:
31 path: /ready
32 port: 8080
33 initialDelaySeconds: 5
34 periodSeconds: 5
35 timeoutSeconds: 3
36 failureThreshold: 3
2. Horizontal Pod Autoscaler
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: api-server-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api-server
10 minReplicas: 3
11 maxReplicas: 20
12 metrics:
13 - type: Resource
14 resource:
15 name: cpu
16 target:
17 type: Utilization
18 averageUtilization: 70
19 - type: Resource
20 resource:
21 name: memory
22 target:
23 type: Utilization
24 averageUtilization: 80
25 # Métricas custom
26 - type: Pods
27 pods:
28 metric:
29 name: http_requests_per_second
30 target:
31 type: AverageValue
32 averageValue: "1000"
33 behavior:
34 scaleDown:
35 stabilizationWindowSeconds: 300
36 policies:
37 - type: Percent
38 value: 10
39 periodSeconds: 60
40 scaleUp:
41 stabilizationWindowSeconds: 0
42 policies:
43 - type: Percent
44 value: 100
45 periodSeconds: 15
46 - type: Pods
47 value: 4
48 periodSeconds: 15
49 selectPolicy: Max
Seguridad en Kubernetes
1. Network Policies
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: api-server-netpol
5 namespace: production
6spec:
7 podSelector:
8 matchLabels:
9 app: api-server
10 policyTypes:
11 - Ingress
12 - Egress
13 ingress:
14 - from:
15 # Solo permitir tráfico del ingress controller
16 - namespaceSelector:
17 matchLabels:
18 name: ingress-nginx
19 - podSelector:
20 matchLabels:
21 app: frontend
22 ports:
23 - protocol: TCP
24 port: 8080
25 egress:
26 # Permitir DNS
27 - to:
28 - namespaceSelector:
29 matchLabels:
30 name: kube-system
31 ports:
32 - protocol: UDP
33 port: 53
34 # Permitir acceso a base de datos
35 - to:
36 - podSelector:
37 matchLabels:
38 app: postgres
39 ports:
40 - protocol: TCP
41 port: 5432
2. Pod Security Standards
1apiVersion: v1
2kind: Pod
3metadata:
4 name: secure-pod
5spec:
6 securityContext:
7 runAsNonRoot: true
8 runAsUser: 1000
9 fsGroup: 2000
10 seccompProfile:
11 type: RuntimeDefault
12 containers:
13 - name: app
14 image: infraux/app:latest
15 securityContext:
16 allowPrivilegeEscalation: false
17 readOnlyRootFilesystem: true
18 capabilities:
19 drop:
20 - ALL
21 volumeMounts:
22 - name: tmp
23 mountPath: /tmp
24 - name: cache
25 mountPath: /app/cache
26 volumes:
27 - name: tmp
28 emptyDir: {}
29 - name: cache
30 emptyDir: {}
Monitoreo y Observabilidad
Stack Completo con Prometheus
1# prometheus-values.yaml
2prometheus:
3 prometheusSpec:
4 retention: 30d
5 storageSpec:
6 volumeClaimTemplate:
7 spec:
8 accessModes: ["ReadWriteOnce"]
9 resources:
10 requests:
11 storage: 100Gi
12 resources:
13 requests:
14 memory: 2Gi
15 cpu: 1
16 limits:
17 memory: 4Gi
18 cpu: 2
19 # Reglas de alerta críticas
20 additionalScrapeConfigs:
21 - job_name: 'kubernetes-pods'
22 kubernetes_sd_configs:
23 - role: pod
24 relabel_configs:
25 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
26 action: keep
27 regex: true
28
29alertmanager:
30 config:
31 route:
32 group_by: ['alertname', 'cluster', 'service']
33 group_wait: 10s
34 group_interval: 10s
35 repeat_interval: 12h
36 receiver: 'slack-critical'
37 routes:
38 - match:
39 severity: critical
40 receiver: pagerduty-critical
41 receivers:
42 - name: 'slack-critical'
43 slack_configs:
44 - api_url: 'YOUR_SLACK_WEBHOOK'
45 channel: '#alerts-critical'
Dashboards Esenciales
1{
2 "dashboard": {
3 "title": "Kubernetes Cluster Health",
4 "panels": [
5 {
6 "title": "CPU Usage by Namespace",
7 "targets": [{
8 "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)"
9 }]
10 },
11 {
12 "title": "Memory Usage by Pod",
13 "targets": [{
14 "expr": "sum(container_memory_working_set_bytes) by (pod, namespace)"
15 }]
16 },
17 {
18 "title": "Pod Restart Count",
19 "targets": [{
20 "expr": "sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)"
21 }]
22 }
23 ]
24 }
25}
Gestión de Configuración
1. ConfigMaps y Secrets
1# Usar Sealed Secrets para encriptación
2apiVersion: bitnami.com/v1alpha1
3kind: SealedSecret
4metadata:
5 name: api-secrets
6 namespace: production
7spec:
8 encryptedData:
9 database-url: AgBvV2x5M3J0... # Encriptado
10 api-key: AgCmV9x2M5K1... # Encriptado
11---
12# ConfigMap para configuración no sensible
13apiVersion: v1
14kind: ConfigMap
15metadata:
16 name: api-config
17data:
18 config.yaml: |
19 server:
20 port: 8080
21 timeout: 30s
22 features:
23 rateLimit: true
24 cache: true
25 redis:
26 host: redis-service
27 port: 6379
2. Helm Charts Profesionales
1# Chart.yaml
2apiVersion: v2
3name: infraux-app
4description: InfraUX Application Helm Chart
5type: application
6version: 1.0.0
7appVersion: "2.1.0"
8
9# values.yaml
10replicaCount: 3
11
12image:
13 repository: infraux/app
14 pullPolicy: IfNotPresent
15 tag: "" # Sobrescrito por CI/CD
16
17ingress:
18 enabled: true
19 className: nginx
20 annotations:
21 cert-manager.io/cluster-issuer: letsencrypt-prod
22 nginx.ingress.kubernetes.io/rate-limit: "100"
23 hosts:
24 - host: api.infraux.com
25 paths:
26 - path: /
27 pathType: Prefix
28 tls:
29 - secretName: api-tls
30 hosts:
31 - api.infraux.com
32
33autoscaling:
34 enabled: true
35 minReplicas: 3
36 maxReplicas: 20
37 targetCPUUtilizationPercentage: 70
38 targetMemoryUtilizationPercentage: 80
Disaster Recovery
1. Backup con Velero
1# Instalar Velero
2velero install \
3 --provider aws \
4 --plugins velero/velero-plugin-for-aws:v1.5.0 \
5 --bucket velero-backups \
6 --secret-file ./credentials-velero \
7 --backup-location-config region=us-east-1 \
8 --snapshot-location-config region=us-east-1
9
10# Crear backup schedule
11velero schedule create daily-backup \
12 --schedule="0 2 * * *" \
13 --include-namespaces production,staging \
14 --ttl 720h
2. Multi-Region Setup
1# Federation config para multi-region
2apiVersion: v1
3kind: Service
4metadata:
5 name: api-service
6 annotations:
7 service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
8 service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
9spec:
10 type: LoadBalancer
11 selector:
12 app: api-server
13 ports:
14 - port: 443
15 targetPort: 8080
Troubleshooting Avanzado
Comandos Esenciales
1# Debug de pods que no arrancan
2kubectl describe pod <pod-name> -n production
3kubectl logs <pod-name> -n production --previous
4
5# Verificar recursos del nodo
6kubectl top nodes
7kubectl describe node <node-name>
8
9# Debug de networking
10kubectl exec -it <pod-name> -n production -- nslookup kubernetes.default
11kubectl exec -it <pod-name> -n production -- curl -v telnet://service-name:port
12
13# Analizar eventos del cluster
14kubectl get events --sort-by='.lastTimestamp' -A
15
16# Debug de permisos RBAC
17kubectl auth can-i --list --as=system:serviceaccount:production:api-sa
Optimizaciones de Performance
Área | Optimización | Impacto |
---|---|---|
Image Size | Distroless images | -80% tamaño |
Startup Time | Init containers paralelos | -50% tiempo |
DNS | NodeLocal DNSCache | -90% latencia DNS |
Scheduling | Pod Topology Spread | +40% disponibilidad |
Métricas Clave para Monitorear
<div class="warning-box"> ⚠️ **Alertas Críticas:** - CPU/Memory > 80% por más de 5 minutos - Pod restarts > 5 en 1 hora - Node NotReady - PVC casi lleno (>85%) - Certificate expiration < 7 días </div>Lecciones Aprendidas
<div class="info-box"> 💡 **Los 5 Mandamientos de K8s en Producción:** 1. Nunca confíes en un pod sin health checks 2. Los recursos sin límites son bombas de tiempo 3. El monitoreo no es opcional, es crítico 4. Los backups no probados no son backups 5. La seguridad por defecto es insuficiente </div>Herramientas Indispensables
- kubectl-neat: Limpia YAMLs para reutilizar
- k9s: Terminal UI para Kubernetes
- stern: Multi-pod log tailing
- kubectx/kubens: Cambio rápido de contexto
- kustomize: Gestión de configuración sin templates
Kubernetes es poderoso pero complejo. Con estas prácticas, puedes dormir tranquilo sabiendo que tu cluster está preparado para cualquier cosa.
#kubernetes#k8s#devops#containers#orchestration