sentimental programmer: liveness

Sentimental Programmer | ysoftman

레이블이 liveness인 게시물을 표시합니다. 모든 게시물 표시

prometheus timeout

grafana 에서 7일, 10일 등으로 긴 기간으로 보면 prometheus pod 가 메모리가 늘어나다가 prometheus container 가 재시작하는 현상이 발생했다.

pod describe 해보면 health, ready 에서 context deadlin exceed 가 보인다.

grafana 의 큰 데이터(날짜 길이 길게하는등) 요청에 시간이 많이 걸리고 그때 readiness / liveness 가 실패하고 pod 는 container 가 죽었다고 판단해서 재시작하는것으로 보인다.

참고로 prometheus 리소스가 별로도 설정되어 있지 않아 메모리도 늘어나도 된다.

(메모리 부족 pod describe 에서 OOM kill 같은 메시지가 보였을거다.)

# http://ysoftman-prometheus.test.abc:9090/-/ready 정상 응답일때

Prometheus Server is Ready.

# http://ysoftman-prometheus.test.abc:9090/-/health 정상 응답일때

Prometheus Server is Healthy.

# prometheus pod는 statefulset(sts) 로 관리된다.

# prometheus operator 로 운영되고 있어 prometheus statefulset(sts)를 바로 수정해도 변경되지 않는다.

# prometheus(CRD) 이름 파악

kubectl get prometheus

# 다음과 같이 prometheus (crd) > spec > containers > {readinessProbe, livenessProbe} > timeoutSeconds 를 늘려주자

kubectl edit prometheus {prometheus이름} -n {prometheus네임스페이스}

spec:

containers:

- livenessProbe:

failureThreshold: 6

httpGet:

path: /-/healthy

port: 9090 # 요건 보통 http-web 으로 portName이 설정되어 있어, http-web 으로 설정해도 된다.

scheme: HTTP

periodSeconds: 5

successThreshold: 1

timeoutSeconds: 30

name: prometheus

readinessProbe:

failureThreshold: 3

httpGet:

path: /-/ready

port: 9090

scheme: HTTP

periodSeconds: 5

successThreshold: 1

timeoutSeconds: 30

# 이제 sts 도 변경되고 prometheus pod 가 위 설정으로 자동 재시작되어 반영된다.

# {livenessProbe, readinessProbe} timeoutSeconds 를 늘려주시 시간이 오래걸리는 grafana 요청에도 pod 가 죽지 않는다.

#####

# 런타임 정보

http://{프로메테우스}/api/v1/status/runtimeinfo

# tsdb 정보

http://{프로메테우스}/api/v1/status/tsdb

# prometheus > status 메뉴 > command-line flags 에서 동시 요청 처리수 --query.max-concurrency=20 (디폴트) 확인

http://{프로메테우스}/flags

# prometheus_engine_queries_concurrent_max 메트릭으로도 확인 가능

# prometheus_engine_queries 현재 요청 수 메트릭(/ready 응답이 느려지는 순간에는 조회가 안되니 그래프로 봐야 알 수 있다.)

# prometheus ready 응답 느려지는 원인 파악을 위한 테스트

https://github.com/ysoftman/test_code/blob/master/promql/promql.sh

#####

# 쿼리 로그 남기기

# prometheus 쿼리를 로그파일로 남기기려면 global > query_log_file

# /dev/stdout 나 실제 파일 경로 명시

# 파일의 경우 log rotation 이 되지 않기 때문에 파일에 커지니 주의해야 한다.

https://prometheus.io/docs/guides/query-log/

# helm chart 에서는 prometheusspec > queryLogFile

https://github.com/prometheus-community/helm-charts/blob/7e8cc15d1106e55b91438610f223aa762c201be3/charts/kube-prometheus-stack/values.yaml#L4501

# prometheus operator 를 사용 중이다.

# 테스트를 위해 임시로 parameters > values 에 추가한다.

kube-prometheus-stack:

prometheus:

prometheusSpec:

# queryLogFile: /prometheus/query.log # 파일로 남길 경우

queryLogFile: /dev/stdout

# 이제 prometheus(crd) 에 설정되고 prometheus container stdout 에 쿼리와 수행시간이 포함된 로그가 출력된다.

# prometheus_engine_query_log_enabled 메트릭으로 사용 상태 확인

# 3.5초 이상 걸린 쿼리 필털링

# json 이 아닌 로그를 걸러내기 위해 -R(--raw-input), fromjson? 사용

kubectl logs -f ysoftman-prometheus-0 -n ysoftman-ns -c prometheus | jq -R 'fromjson? | select(.stats.timings.execTotalTime >= 3.5)'

# 쿼리 확인 되면 queryLogFile 부분은 삭제해서 쿼리 파일 기록은 멈추자.

# 참고로 container 가 재시작 되지는 않아 파일로 기록했다면 파일이 남아 있어 필요하면 수동 삭제한다.