Kafkaサービスが利用できない(ダウンタイム)問題の踏み込み

11078 ワード

背景


あるオンラインログ収集サービスアラームは、ドメイン名レポート502エラーコードを開く.収集サービスは2台のnetty HAサーバーから構成され、nettyサーバーはクライアントから送られてきたprotobufログを解析しkafkaに送信し、そのうちの1つのアプリケーションのログを開き、以下のエラーを発見した.
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) 

nettyサービスのエラーを排除した後、kafkaのログを表示します.報告ミスを発見し、調査過程は以下の通りである.

構成情報


システム
kafkaバージョン
broker数
CentOS7.4
2.1.0
3
オンライン上にはKafka Brokerが3台あり,idはそれぞれ0,1,2であり,サーバにはKafkaサービスのみが配備されている.

に質問


スレッドが生存しているかどうか
まずjpsはKafkaスレッドが生存しているかどうかを確認し、3台のマシンは大丈夫で、kafkaは依然として稼働しています.
GC問題
kafkaServer-gcを表示log.1.currentのログ、gcログに異常は見つかりませんでした.
Broker 0/server.log
[2019-08-02 15:17:03,699] WARN Attempting to send response via channel for which there is no open connection, connection id 172.21.3.14:9092-172.21.3.11:54311-107706 (kafka.network.Processor)
[2019-08-02 15:19:12,490] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) [2019-08-02 15:26:54,405] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1112819217, epoch=1897450) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler) [2019-08-02 15:26:54,411] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={galaxy_client-7=(offset=15680912, logStartOffset=14755985, maxBytes=1048576, currentLeaderEpoch=Optional[9])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1112819217, epoch=1897450)) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 1 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-08-02 15:27:26,433] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1112819217, epoch=INITIAL) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler) 

Broker 1/server.log
[2019-08-02 15:26:53,751] WARN [GroupCoordinator 1]: Failed to write empty metadata for group StoreToHiveV2: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)
[2019-08-02 15:26:55,515] WARN [GroupCoordinator 1]: Failed to write empty metadata for group LPmkt: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator) [2019-08-02 15:26:56,124] WARN [GroupCoordinator 1]: Failed to write empty metadata for group Store2CarbonClientServer: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator) [2019-08-02 15:26:56,575] WARN [GroupCoordinator 1]: Failed to write empty metadata for group StoreToHive: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator) 

Broker 2/server.log
[2019-08-02 15:26:54,514] WARN [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=2, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={galaxy_client-7=(offset=15680912, logStartOffset=14755985, maxBytes=1048576, currentLeaderEpoch=Optional[9])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1550207796, epoch=1890003)) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 1 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-08-02 15:27:26,569] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1550207796, epoch=INITIAL) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler) 

問題の調査


オンラインアプリケーションなので、問題を調べる時間がほとんどないので、すぐに3台のKafkaを再起動し、再起動してから緩和されました.
サービスが正常になった後、ネット上で多くの答えを探して、ある人はネットの問題だと言って、close waitが多すぎてbrokerのネットが通じないことを招いて、Kafkaは脳裂の問題を生みます.各サーバでコマンドを実行します.
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' 

結果
ESTABLISHED 172
TIME_WAIT 32

異常も発見されず、次に問題が発生した場合は、このコマンドを実行してサーバーの状態を確認します.

コミュニティバグ?


https://issues.apache.org/jira/browse/KAFKA-6582
下にコメントがあります:We are running 2.1.1 in production(and four other environments)since March without this issue showing again.
だから時間を見つけてKafkaバージョンをアップグレードしましょう~