Técnico18 min

Kubernetes en Producción: Guía de Mejores Prácticas

Todo lo que aprendimos desplegando Kubernetes en producción. Desde configuración de clusters hasta monitoreo, seguridad y troubleshooting.

CR

Carlos Rodríguez

CEO & Co-founder

Kubernetes en Producción: Lecciones del Campo de Batalla

Después de gestionar clusters de Kubernetes que sirven millones de requests diarios, estas son las lecciones más valiosas que hemos aprendido.

Arquitectura del Cluster

Topología Recomendada

1# cluster-topology.yaml 2apiVersion: v1 3kind: Namespace 4metadata: 5 name: production 6 labels: 7 environment: production 8 team: platform 9--- 10# Separación por namespaces 11namespaces: 12 - kube-system # Componentes del sistema 13 - kube-public # Recursos públicos 14 - default # No usar en producción 15 - production # Aplicaciones de producción 16 - staging # Ambiente de staging 17 - monitoring # Stack de monitoreo 18 - ingress-nginx # Controlador de ingress 19 - cert-manager # Gestión de certificados

Resource Management

1. Siempre Define Limits y Requests

1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: api-server 5 namespace: production 6spec: 7 replicas: 3 8 template: 9 spec: 10 containers: 11 - name: api 12 image: infraux/api:v1.2.3 13 resources: 14 requests: 15 memory: "256Mi" 16 cpu: "250m" 17 limits: 18 memory: "512Mi" 19 cpu: "500m" 20 # Liveness y Readiness son críticos 21 livenessProbe: 22 httpGet: 23 path: /health 24 port: 8080 25 initialDelaySeconds: 30 26 periodSeconds: 10 27 timeoutSeconds: 5 28 failureThreshold: 3 29 readinessProbe: 30 httpGet: 31 path: /ready 32 port: 8080 33 initialDelaySeconds: 5 34 periodSeconds: 5 35 timeoutSeconds: 3 36 failureThreshold: 3

2. Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: api-server-hpa 5spec: 6 scaleTargetRef: 7 apiVersion: apps/v1 8 kind: Deployment 9 name: api-server 10 minReplicas: 3 11 maxReplicas: 20 12 metrics: 13 - type: Resource 14 resource: 15 name: cpu 16 target: 17 type: Utilization 18 averageUtilization: 70 19 - type: Resource 20 resource: 21 name: memory 22 target: 23 type: Utilization 24 averageUtilization: 80 25 # Métricas custom 26 - type: Pods 27 pods: 28 metric: 29 name: http_requests_per_second 30 target: 31 type: AverageValue 32 averageValue: "1000" 33 behavior: 34 scaleDown: 35 stabilizationWindowSeconds: 300 36 policies: 37 - type: Percent 38 value: 10 39 periodSeconds: 60 40 scaleUp: 41 stabilizationWindowSeconds: 0 42 policies: 43 - type: Percent 44 value: 100 45 periodSeconds: 15 46 - type: Pods 47 value: 4 48 periodSeconds: 15 49 selectPolicy: Max

Seguridad en Kubernetes

1. Network Policies

1apiVersion: networking.k8s.io/v1 2kind: NetworkPolicy 3metadata: 4 name: api-server-netpol 5 namespace: production 6spec: 7 podSelector: 8 matchLabels: 9 app: api-server 10 policyTypes: 11 - Ingress 12 - Egress 13 ingress: 14 - from: 15 # Solo permitir tráfico del ingress controller 16 - namespaceSelector: 17 matchLabels: 18 name: ingress-nginx 19 - podSelector: 20 matchLabels: 21 app: frontend 22 ports: 23 - protocol: TCP 24 port: 8080 25 egress: 26 # Permitir DNS 27 - to: 28 - namespaceSelector: 29 matchLabels: 30 name: kube-system 31 ports: 32 - protocol: UDP 33 port: 53 34 # Permitir acceso a base de datos 35 - to: 36 - podSelector: 37 matchLabels: 38 app: postgres 39 ports: 40 - protocol: TCP 41 port: 5432

2. Pod Security Standards

1apiVersion: v1 2kind: Pod 3metadata: 4 name: secure-pod 5spec: 6 securityContext: 7 runAsNonRoot: true 8 runAsUser: 1000 9 fsGroup: 2000 10 seccompProfile: 11 type: RuntimeDefault 12 containers: 13 - name: app 14 image: infraux/app:latest 15 securityContext: 16 allowPrivilegeEscalation: false 17 readOnlyRootFilesystem: true 18 capabilities: 19 drop: 20 - ALL 21 volumeMounts: 22 - name: tmp 23 mountPath: /tmp 24 - name: cache 25 mountPath: /app/cache 26 volumes: 27 - name: tmp 28 emptyDir: {} 29 - name: cache 30 emptyDir: {}

Monitoreo y Observabilidad

Stack Completo con Prometheus

1# prometheus-values.yaml 2prometheus: 3 prometheusSpec: 4 retention: 30d 5 storageSpec: 6 volumeClaimTemplate: 7 spec: 8 accessModes: ["ReadWriteOnce"] 9 resources: 10 requests: 11 storage: 100Gi 12 resources: 13 requests: 14 memory: 2Gi 15 cpu: 1 16 limits: 17 memory: 4Gi 18 cpu: 2 19 # Reglas de alerta críticas 20 additionalScrapeConfigs: 21 - job_name: 'kubernetes-pods' 22 kubernetes_sd_configs: 23 - role: pod 24 relabel_configs: 25 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] 26 action: keep 27 regex: true 28 29alertmanager: 30 config: 31 route: 32 group_by: ['alertname', 'cluster', 'service'] 33 group_wait: 10s 34 group_interval: 10s 35 repeat_interval: 12h 36 receiver: 'slack-critical' 37 routes: 38 - match: 39 severity: critical 40 receiver: pagerduty-critical 41 receivers: 42 - name: 'slack-critical' 43 slack_configs: 44 - api_url: 'YOUR_SLACK_WEBHOOK' 45 channel: '#alerts-critical'

Dashboards Esenciales

1{ 2 "dashboard": { 3 "title": "Kubernetes Cluster Health", 4 "panels": [ 5 { 6 "title": "CPU Usage by Namespace", 7 "targets": [{ 8 "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)" 9 }] 10 }, 11 { 12 "title": "Memory Usage by Pod", 13 "targets": [{ 14 "expr": "sum(container_memory_working_set_bytes) by (pod, namespace)" 15 }] 16 }, 17 { 18 "title": "Pod Restart Count", 19 "targets": [{ 20 "expr": "sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)" 21 }] 22 } 23 ] 24 } 25}

Gestión de Configuración

1. ConfigMaps y Secrets

1# Usar Sealed Secrets para encriptación 2apiVersion: bitnami.com/v1alpha1 3kind: SealedSecret 4metadata: 5 name: api-secrets 6 namespace: production 7spec: 8 encryptedData: 9 database-url: AgBvV2x5M3J0... # Encriptado 10 api-key: AgCmV9x2M5K1... # Encriptado 11--- 12# ConfigMap para configuración no sensible 13apiVersion: v1 14kind: ConfigMap 15metadata: 16 name: api-config 17data: 18 config.yaml: | 19 server: 20 port: 8080 21 timeout: 30s 22 features: 23 rateLimit: true 24 cache: true 25 redis: 26 host: redis-service 27 port: 6379

2. Helm Charts Profesionales

1# Chart.yaml 2apiVersion: v2 3name: infraux-app 4description: InfraUX Application Helm Chart 5type: application 6version: 1.0.0 7appVersion: "2.1.0" 8 9# values.yaml 10replicaCount: 3 11 12image: 13 repository: infraux/app 14 pullPolicy: IfNotPresent 15 tag: "" # Sobrescrito por CI/CD 16 17ingress: 18 enabled: true 19 className: nginx 20 annotations: 21 cert-manager.io/cluster-issuer: letsencrypt-prod 22 nginx.ingress.kubernetes.io/rate-limit: "100" 23 hosts: 24 - host: api.infraux.com 25 paths: 26 - path: / 27 pathType: Prefix 28 tls: 29 - secretName: api-tls 30 hosts: 31 - api.infraux.com 32 33autoscaling: 34 enabled: true 35 minReplicas: 3 36 maxReplicas: 20 37 targetCPUUtilizationPercentage: 70 38 targetMemoryUtilizationPercentage: 80

Disaster Recovery

1. Backup con Velero

1# Instalar Velero 2velero install \ 3 --provider aws \ 4 --plugins velero/velero-plugin-for-aws:v1.5.0 \ 5 --bucket velero-backups \ 6 --secret-file ./credentials-velero \ 7 --backup-location-config region=us-east-1 \ 8 --snapshot-location-config region=us-east-1 9 10# Crear backup schedule 11velero schedule create daily-backup \ 12 --schedule="0 2 * * *" \ 13 --include-namespaces production,staging \ 14 --ttl 720h

2. Multi-Region Setup

1# Federation config para multi-region 2apiVersion: v1 3kind: Service 4metadata: 5 name: api-service 6 annotations: 7 service.beta.kubernetes.io/aws-load-balancer-type: "nlb" 8 service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true" 9spec: 10 type: LoadBalancer 11 selector: 12 app: api-server 13 ports: 14 - port: 443 15 targetPort: 8080

Troubleshooting Avanzado

Comandos Esenciales

1# Debug de pods que no arrancan 2kubectl describe pod <pod-name> -n production 3kubectl logs <pod-name> -n production --previous 4 5# Verificar recursos del nodo 6kubectl top nodes 7kubectl describe node <node-name> 8 9# Debug de networking 10kubectl exec -it <pod-name> -n production -- nslookup kubernetes.default 11kubectl exec -it <pod-name> -n production -- curl -v telnet://service-name:port 12 13# Analizar eventos del cluster 14kubectl get events --sort-by='.lastTimestamp' -A 15 16# Debug de permisos RBAC 17kubectl auth can-i --list --as=system:serviceaccount:production:api-sa

Optimizaciones de Performance

ÁreaOptimizaciónImpacto
Image SizeDistroless images-80% tamaño
Startup TimeInit containers paralelos-50% tiempo
DNSNodeLocal DNSCache-90% latencia DNS
SchedulingPod Topology Spread+40% disponibilidad

Métricas Clave para Monitorear

<div class="warning-box"> ⚠️ **Alertas Críticas:** - CPU/Memory > 80% por más de 5 minutos - Pod restarts > 5 en 1 hora - Node NotReady - PVC casi lleno (>85%) - Certificate expiration < 7 días </div>

Lecciones Aprendidas

<div class="info-box"> 💡 **Los 5 Mandamientos de K8s en Producción:** 1. Nunca confíes en un pod sin health checks 2. Los recursos sin límites son bombas de tiempo 3. El monitoreo no es opcional, es crítico 4. Los backups no probados no son backups 5. La seguridad por defecto es insuficiente </div>

Herramientas Indispensables

  • kubectl-neat: Limpia YAMLs para reutilizar
  • k9s: Terminal UI para Kubernetes
  • stern: Multi-pod log tailing
  • kubectx/kubens: Cambio rápido de contexto
  • kustomize: Gestión de configuración sin templates

Kubernetes es poderoso pero complejo. Con estas prácticas, puedes dormir tranquilo sabiendo que tu cluster está preparado para cualquier cosa.

#kubernetes#k8s#devops#containers#orchestration