Alertmanagerアラーム規則の詳細

5393 ワード

転載は出典を明記してください、原文のリンクhttp://tailnode.tk/2017/03/al...
説明
この記事では、prometheusとalertmanagerのアラームと通知ルールについて説明します.prometheusのプロファイル名はprometheus.yml、alertmanagerのプロファイル名はalertmanager.ymlアラームです.prometheusが監視した異常イベントをalertmanagerに送信することを指します.メール通知を送信することを指しません.alertmanagerが異常イベントを送信する通知(メール、webhookなど)を指します.
アラーム規則prometheus.ymlでアラーム規則に一致する間隔を指定

# How frequently to evaluate rules.
[ evaluation_interval:  | default = 1m ]

prometheus.ymlでルール・ファイルを指定します(rules/*.rulesなどのワイルドカードを使用できます).

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/mengyuan.rules

rulesディレクトリにmengyuan.rulesを追加

ALERT goroutines_gt_70
  IF go_goroutines > 70
  FOR 5s  
  LABELS { status = "yellow" }
  ANNOTATIONS {
    summary = "goroutines    70，   {{ $value }}",
    description = "     {{ $labels.instance }}",
  }

ALERT goroutines_gt_90
  IF go_goroutines > 90
  FOR 5s  
  LABELS { status = "red" }
  ANNOTATIONS {
    summary = "goroutines    90，   {{ $value }}",
    description = "     {{ $labels.instance }}",
  }

プロファイルの設定後、prometheusを再読み込みするには、次の2つの方法があります.

HTTP APIを介してPOST要求を/-/reloadに送信する例:curl -X POST http://localhost:9090/-/reload

prometheusプロセスにSIGHUP信号

を送信する
メール通知をrulesと比較する(メールを受信するにはalertmanager.ymlを構成する必要がある)

通知ルールalertmanager.ymlのrouteとreceiversを設定

route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 5s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: mengyuan

receivers:
- name: 'mengyuan'
  webhook_configs:
  - url: http://192.168.0.53:8080
  email_configs:
  - to: '[email protected]'

名詞の解釈
Route routeプロパティは、アラームの配布ポリシーを設定するために使用され、深さが左から右に優先される順に一致するツリー構造である.

// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {

Alert
Alertはalertmanagerが受信したアラームで、タイプは以下の通りです.

// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
    // Label value pairs for purpose of aggregation, matching, and disposition
    // dispatching. This must minimally include an "alertname" label.
    Labels LabelSet `json:"labels"`

    // Extra key/value information which does not define alert identity.
    Annotations LabelSet `json:"annotations"`

    // The known time range for this alert. Both ends are optional.
    StartsAt     time.Time `json:"startsAt,omitempty"`
    EndsAt       time.Time `json:"endsAt,omitempty"`
    GeneratorURL string    `json:"generatorURL"`
}

同じLablesのAlert(keyもvalueも同じ)は同じとみなされます.prometheus rulesファイル構成のルールでは、複数のアラームが発生する可能性があります.
Group
Alertmanagerは、group_byの構成に従ってAlertをグループ化します.次のルールでgo_goroutinesが4に等しいと3つのアラームが受信され、alertmanagerはこの3つのアラームを2つのグループに分けてreceiversに通知します.

ALERT test1
  IF go_goroutines > 1
  LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
  IF go_goroutines > 2
  LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
  IF go_goroutines > 3
  LABELS {label1="l2", label2="l1", status="test"}

主なプロセス

Alertを受信し、labelsに基づいてどのRouteに属するか(複数のRouteが存在し、1つのRouteに複数のGroupがあり、1つのGroupに複数のAlertがあるか)

を判断する.

AlertをGroupに割り当て、なければ新規Group

を作成

新しいGroupは、group_waitが指定する時間(待機時に同じGroupのAlertを受け取る可能性がある)を待ち、resolve_timeoutに基づいてAlertが解決するか否かを判断し、通知

を送信する.

既存のGroupは、group_intervalで指定する時間を待ち、Alertが解決したか否かを判断し、前回送信通知から現在までの間隔がrepeat_intervalより大きい場合、またはGroupが更新した場合に、

に通知を送信する.
TODO

再起動によるアラーム送信と通知への影響

クラスタ

を構成できるかどうか
リファレンス

https://github.com/prometheus...

https://prometheus.io/blog/20...

http://studygolang.com/articl...

http://www.admpub.com/blog/po...

半日は素早くvueで小さなゲームを作ります——“馬雲のお金を使い果たします”

Hibernate DAO設計方法