モニタリング指標およびprometheusルール-継続的な改善中


(1)node exporter標準性能指標
1)モニタ項目cpu使用率:(100-(avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5 m])100))メモリ使用率:(100-((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes)100))ディスク使用率:(1-(node_filesystem_free_bytes{fstype=~「ext 3|ext 4|xfs」}/node_filesystem_size_bytes{fstype=~「ext 3|ext 4|xfs」})*100
2)prometheusルール
groups:
- name: alert-rule
    rules:
    - alert: NodeFilesystemUsage-high
        expr: (1-  (node_filesystem_free_bytes{fstype=~"ext3|ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext3|ext4|xfs"}) ) * 100 > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: High Node Filesystem usage detected"
            description: "{{$labels.instance}}: Node Filesystem usage is above 80% ,(current value is: {{ $value }})"
    - alert: NodeMemoryUsage
        expr: (100 - (((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes) * 100))  > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: High Node Memory usage detected"
            description: "{{$labels.instance}}: Node Memory usage is above 80% ,(current value is: {{ $value }})"
    - alert: NodeCPUUsage
        expr: (100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))  > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Node High CPU usage detected"
            description: "{{$labels.instance}}: Node CPU usage is above 80% ,(current value is: {{ $value }})"

(2)mysqlモニタリング性能指標
1)mysql性能指標
mysql is down :mysql_up

        :rate(mysql_global_status_slow_queries[5m])

     :rate(mysql_global_status_threads_connected[5m]) > 200
         mysql_global_variables_max_connections - mysql_global_status_threads_connected <200

   :rate(mysql_global_status_slow_queries[5m])

mysql     sql  : mysql_slave_status_slave_sql_running 
 mysql    :rate(mysql_slave_status_seconds_behind_master[5m])

2)prometheusルール
groups:
- name: MySQLStatsAlert
    rules:
    - alert: MySQL is down
        expr: mysql_up == 0
        for: 1m
        labels:
            severity: critical
        annotations:
            summary: "Instance {{ $labels.instance }} MySQL is down"
            description: "MySQL database is down. This requires immediate action!"
    - alert: Mysql_High_QPS
        expr: rate(mysql_global_status_questions[5m]) > 500 
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql_High_QPS detected"
            description: "{{$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {{ $value }})"  
    - alert: Mysql_Too_Many_Connections
        expr: rate(mysql_global_status_threads_connected[5m]) > 200
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql Too Many Connections detected"
            description: "{{$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {{ $value }})"  
    - alert: Mysql_Too_Many_slow_queries
        expr: rate(mysql_global_status_slow_queries[5m]) > 3
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql_Too_Many_slow_queries detected"
            description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"  
    - alert: SQL thread stopped 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
            severity: critical
        annotations:
            summary: "Instance {{ $labels.instance }} SQL thread stopped"
            description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
    - alert: Slave lagging behind Master
        expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30 
        for: 1m
        labels:
            severity: warning 
        annotations:
            summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
            description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"

(3)pod性能指標1)容器性能指標
pod cpu   :container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""}  *100 != +Inf
pod      : sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100

2)prometheusルール
groups:
- name: noah_pod.rules
  rules:
  - alert: PodMemUsage
    expr: container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""}  *100 != +Inf > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.name}}: Pod High Mem usage detected"
      description: "{{$labels.name}}: Pod Mem is above 80% ,(current value is: {{ $value }})"
  - alert: PodCpuUsage
    expr: sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.name}}: Pod High CPU usage detected"
      description: "{{$labels.name}}: Pod CPU is above 80% ,(current value is: {{ $value }})"

参照ドキュメント:
http://ylzheng.com/2018/04/02/use-prometheus-monitor-mysql/https://www.cnblogs.com/zengkefu/p/5658252.htmlhttps://blog.csdn.net/qq_25934401/article/details/82594478 https://blog.csdn.net/qq_39570637/article/details/81711328https://blog.csdn.net/ichglauben/article/details/82381438