OpenTelemetry Auto-Instrumentation & Observability-Driven Autoscaling 파이프라인

⸻

🎯 목표

코드 수정 없이 애플리케이션에 자동으로 분산 트레이싱을 삽입하고, 획득한 **트레이스 기반 SLO 메트릭(p95 응답속도, 오류율 등)**을 사용해 Kubernetes를 자동 스케일링하며, 전체 설정을 GitOps로 선언·관리하는 엔드-투-엔드 Observability-as-Code 파이프라인을 구축합니다.

⸻

⚙️ 핵심 구성 요소
1. OpenTelemetry Operator
• Kubernetes에 자동 계측(automatic instrumentation) 사이드카 주입
• Java, Node.js, Python 애플리케이션에 런타임 무손실 계측
2. Collector + Tempo/Jaeger
• Collector로 수집된 트레이스 데이터를 Grafana Tempo/Jaeger에 저장
• 백엔드에서 자동 인덱싱 및 검색 가능
3. Prometheus Adapter
• Collector Exporter 또는 Tempo Plugin으로 추출한 trace_latency_p95 등 메트릭을 Prometheus로 노출
4. KEDA ScaledObject
• PromQL 기반 트레이스 메트릭(p95, 오류율)으로 Pod 수 조정
5. Argo CD & GitOps
• 모든 OpenTelemetry 리소스, Collector 설정, KEDA ScaledObject를 Git 저장소에 선언
• automated sync + selfHeal으로 클러스터 상태 일관성 보장

⸻

🏗️ 파이프라인 워크플로우

[Git Push] ──▶ Argo CD Sync ──▶ OpenTelemetry Operator 배포
                                   │
                                   ▼
                       [애플리케이션 Pod에 계측 사이드카 주입]
                                   │
                       [트레이스 수집 → Tempo/Jaeger 저장]
                                   │
                            ┌──────┴──────┐
                            │ Collector   │
                            └──────┬──────┘
                                   │ 메트릭
                                   ▼
                            [Prometheus]
                                   │
              PromQL: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))
                                   │
                                   ▼
                          ┌────────────────┐
                          │  KEDA ScaledObject  │
                          └──────┬─────────┘
                                 │ scale up/down
                                 ▼
                       [Kubernetes Deployment]
                                 │
                                 ▼
                            [Argo CD Sync]

⸻

🧪 실전 예시

1) OpenTelemetry Operator 배포 (GitOps)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: opentelemetry-operator
spec:
  template:
    spec:
      containers:
        - name: manager
          image: otel/opentelemetry-operator:latest

2) Instrumentation 자동 주입용 Instrumentation CR

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrument
spec:
  exporter:
    otlp:
      endpoint: tempo.otel.svc:4317
  injection:
    target: "java|nodejs|python"

3) Collector Configuration (otel-collector-config.yaml)

receivers:
  otlp:
    protocols: [grpc, http]
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  tempo:
    endpoint: "tempo.otel.svc:3200"
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [tempo, prometheus]

4) Prometheus Rule & KEDA ScaledObject

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: trace-slo-rules }
spec:
  groups:
  - name: trace-slo
    rules:
    - record: app:request_latency_p95:ratio
      expr: |
        histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le,service))
# KEDA ScaledObject
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: otel-autoscaler }
spec:
  scaleTargetRef: { name: myapp-deployment }
  pollingInterval: 30
  cooldownPeriod: 120
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: app:request_latency_p95:ratio
      threshold: "0.90"
      query: |
        app:request_latency_p95:ratio{service="myapp"}

⸻

✅ 기대 효과
• 코드 무손실 계측: 애플리케이션 코드 변경 없이도 자동 계측
• SLO 준수 스케일링: p95 응답속도를 직접 모니터링해 장애 가능성 선제 대응
• GitOps 관점 통합: Observability 리소스부터 스케일 정책까지 Git 선언·추적
• 운영자 관여 최소화: 자동 동기화·자가 치유(Self-Heal)로 안정적인 운영

⸻

“Observability를 코드화하고, 트레이스 메트릭으로 인프라를 자동 제어하는 차세대 DevOps”
라는 주제를 통해, 서비스 신뢰성과 개발 효율성을 동시에 끌어올릴 수 있습니다.

'IT & Tech 정보' 카테고리의 다른 글

ModelOps CI/CD: Kubeflow Pipelines + Seldon + Argo Rollouts 기반 (0)	2025.05.29
Multi-Cloud Disaster Recovery as Code: Crossplane + Argo CD + Terraform 기반 (0)	2025.05.29
Chaos Engineering as Code: LitmusChaos + ArgoCD + Prometheus 기반 장애 주입·회복 자동화 (0)	2025.05.29
AI/ML 기반 AIOps 파이프라인 구축 (0)	2025.05.29
🧠 “서비스 상태 기반 SLA 중심 자동 스케일링: KEDA + Prometheus + 슬랙 경보형 HPA 확장 전략” (0)	2025.05.29

OpenTelemetry Auto-Instrumentation & Observability-Driven Autoscaling 파이프라인

'IT & Tech 정보' 카테고리의 다른 글

관련글

티스토리툴바