filebeat kafka out of acceptable range

log 수집용 filebeat pod 하나가 계속 높은 CPU 사용량을 보인다.

로그를 보면

{"log.level":"error","@timestamp":"2026-04-07T08:12:32.206+0900","log.logger":"kafka","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/outputs/kafka.(*msgRef).dec","file.name":"kafka/client.go","file.line":446},"message":"Kafka publish failed with: kafka server: The timestamp of the message is out of acceptable range","service.name":"filebeat","ecs.version":"1.6.0"}

pod 를 재시작해도 시간이 좀 지나면 다시 CPU 사용량이 90%를 넘어간다.

kafka publish 실패 -> 재시도 무한 반복 -> CPU 급증

broker 레벨 timestamp 설정 확인

kafka-configs --bootstrap-server 카프카호스트:9092 --entity-type brokers --describe --all | rg -i timestamp

log.message.timestamp.type=CreateTime (체크 대상 타입)

log.message.timestamp.after.max.ms=3600000 (1시간, 메시지 시간이 이 시간보다 1시간 초과시 거부)

log.message.timestamp.before.max.ms=9223372036854775807 (사실 무제한, 과거 메시지 모두 허용)

메시지 timestamp 가 before / after 를 벗어난 경우 kafka broker가 timestamp 범위 초과로 메시지를 거부한다.

kafka timestamp 설정은 이상이 없어 보인다.

filebeat.yml 설정에 max_retries가 명시되어 있지 않아 디폴트 3으로 사용된다.

https://www.elastic.co/docs/reference/beats/filebeat/filebeat-reference-yml

output.kafka:

enabled: true

hosts: "${kafka_hosts}"

topic: "%{[@metadata][topic]}"

partition.round_robin:

reachable_only: false

group_events: 1

max_retries 회수가 제한이 있는데도 CPU가 안 떨어진다는 건 처리해야 할 오래된 로그 라인 자체가 엄청나게 많다는뜻일 수 있다.

registry 초기화 후 모든 로그 파일을 처음부터 다시 읽으면서

이벤트마다 전송 시도 -> 실패 -> retry 3번 (backoff 포함) -> drop -> 다음 이벤트 -> 또 실패

이 사이클이 수십만 건 이상 반복되고 있을 수 있습니다. 시간이 충분히 지나면 결국 CPU는 내려가겠지만, 로그 양이 많으면 오래 걸릴 수 있습니다.

당장 해결을 위해선 pod 재시작 한다.

pod 재시작 후 에도 같은 현상이면 container 상에서 registry(active.dat, log.json, 12345.json)을 지우고 재시작하자. 이때 meta.json 을 지우면 pod 재시작시 crash 가 발생하니 유지해야 된다.

실수로 삭제했다면 meta.json 는 이런 내용으로 다시 생성하면 된다.

{"version":"1"}

filebeat 수집 대상이 *.log* 로 되어 있어 로그 로테이션으로 log 파일들도 매칭이 되고 있는게 근본적인 원인으로 보여 *.log* -> *.log 패턴으로 수정했다.

filebeat kafka out of acceptable range

comments:

댓글 쓰기