使者离群检测 異常検知


異常検知


在异常检测领域中,常常需要决定新观察的点是否属于与现有观察点相同的分布(则它称为インライラー或者被认为是不同的(称为異常値)离群是异常的数据,但是不一定是错误的数据点.
在使節中,离群点检测是动态确定上游集群中是否有某些主机表现不正常,然后将它们从正常的负载均衡集群中删除的过程.異常検知可以与ヘルシーチェック同时/独立启用,并构成整个上游运行状况检查解决方案的基础.
此处概念不做过多的说明,具体可以参考官方文档与自行グーグル

监测类型

  • 连续的5 xx
  • 连续的网关错误
  • 连续的本地来源错误
  • 更多介绍参考官方文档 outlier detection

    离群检测测试


    说明,此处只能在单机环境测试更多还的参考与实际环境

    环境准备


    ドッカー構成模拟后端5个节点
    version: '3'
    services:
      envoy:
        image: envoyproxy/envoy-alpine:v1.15-latest
        environment: 
        - ENVOY_UID=0
        ports:
        - 80:80
        - 443:443
        - 82:9901
        volumes:
        - ./envoy.yaml:/etc/envoy/envoy.yaml
        networks:
          envoymesh:
            aliases:
            - envoy
        depends_on:
        - webserver1
        - webserver2
    
      webserver1:
        image: sealloong/envoy-end:latest
        networks:
          envoymesh:
            aliases:
            - myservice
            - webservice
        expose:
        - 90
      webserver2:
        image: sealloong/envoy-end:latest
        networks:
          envoymesh:
            aliases:
            - myservice
            - webservice
        expose:
        - 90
      webserver3:
        image: sealloong/envoy-end:latest
        networks:
          envoymesh:
            aliases:
            - myservice
            - webservice
        expose:
        - 90
      webserver4:
        image: sealloong/envoy-end:latest
        networks:
          envoymesh:
            aliases:
            - myservice
            - webservice
        expose:
        - 90
      webserver5:
        image: sealloong/envoy-end:latest
        networks:
          envoymesh:
            aliases:
            - myservice
            - webservice
        expose:
        - 90
    networks:
      envoymesh: {}
    
    使節配置文件
    admin:
      access_log_path: /dev/null
      address:
        socket_address: { address: 0.0.0.0, port_value: 9901 }
    
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 80 }
        filter_chains:
        - filters:
          - name: envoy_http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              stat_prefix: ingress_http
              codec_type: AUTO
              route_config:
                name: local_route
                virtual_hosts:
                - name: local_service
                  domains: [ "*" ]
                  routes:
                  - match: { prefix: "/" }
                    route: { cluster: local_service }
              http_filters:
              - name: envoy.filters.http.router
    
      clusters:
      - name: local_service
        connect_timeout: 0.25s
        type: STRICT_DNS
        lb_policy: ROUND_ROBIN
        load_assignment:
          cluster_name: local_service
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: webservice, port_value: 90 }
        health_checks:
          timeout: 3s
          interval: 90s
          unhealthy_threshold: 5
          healthy_threshold: 5
          no_traffic_interval: 240s
          http_health_check:
            path: "/ping"
            expected_statuses:
              start: 200
              end: 201
        outlier_detection:
          consecutive_5xx: 2
          base_ejection_time: 30s
          max_ejection_percent: 40
          interval: 20s
          success_rate_minimum_hosts: 5
          success_rate_request_volume: 10
    

    配置说明


        outlier_detection:
          consecutive_5xx: 2 # 连续的5xx错误数量
          base_ejection_time: 30s # 弹出主机的基准时间。实际时间等于基本时间乘以主机弹出的次数
          max_ejection_percent: 40 # 可弹出主机集群的最大比例,默认值为10% ,此处为40% 即集群中5个节点的2个节点
          interval: 20s # 间隔时间
          success_rate_minimum_hosts: 5 # 集群中最小主机数量
          success_rate_request_volume: 10 # 在一个时间间隔内中收集请求检测的最小数量
    
    此处为了效果,将主动检测状态时间增加,主机弹出时间增加

    路由

    /502bad模拟一个502的错误

    运行结果


    模拟一些5 xx请求和200请求
     workers
    envoy_1       | [2020-09-13 06:10:01.093][1][warning][main] [source/server/server.cc:537] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
    webserver2_1  | [GIN] 2020/09/13 - 06:10:08 | 200 |      63.272?s |      172.22.0.7 | GET      "/"
    webserver5_1  | [GIN] 2020/09/13 - 06:10:10 | 200 |      46.732?s |      172.22.0.7 | GET      "/"
    webserver1_1  | [GIN] 2020/09/13 - 06:10:11 | 200 |       45.43?s |      172.22.0.7 | GET      "/"
    webserver3_1  | [GIN] 2020/09/13 - 06:10:13 | 502 |      43.858?s |      172.22.0.7 | GET      "/502bad"
    webserver4_1  | [GIN] 2020/09/13 - 06:10:14 | 502 |      47.486?s |      172.22.0.7 | GET      "/502bad"
    webserver2_1  | [GIN] 2020/09/13 - 06:10:15 | 200 |      15.691?s |      172.22.0.7 | GET      "/"
    webserver5_1  | [GIN] 2020/09/13 - 06:10:16 | 200 |      14.719?s |      172.22.0.7 | GET      "/"
    webserver1_1  | [GIN] 2020/09/13 - 06:10:16 | 200 |      15.758?s |      172.22.0.7 | GET      "/"
    webserver3_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      15.697?s |      172.22.0.7 | GET      "/502bad"
    webserver2_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      14.002?s |      172.22.0.7 | GET      "/502bad"
    webserver5_1  | [GIN] 2020/09/13 - 06:10:17 | 502 |      14.913?s |      172.22.0.7 | GET      "/502bad"
    webserver1_1  | [GIN] 2020/09/13 - 06:10:18 | 502 |      14.911?s |      172.22.0.7 | GET      "/502bad"
    webserver4_1  | [GIN] 2020/09/13 - 06:10:18 | 502 |      30.429?s |      172.22.0.7 | GET      "/502bad"
    webserver5_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      14.377?s |      172.22.0.7 | GET      "/"
    webserver1_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      14.861?s |      172.22.0.7 | GET      "/"
    webserver2_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      18.924?s |      172.22.0.7 | GET      "/"
    webserver5_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      15.899?s |      172.22.0.7 | GET      "/"
    webserver1_1  | [GIN] 2020/09/13 - 06:10:19 | 200 |      24.849?s |      172.22.0.7 | GET      "/"
    
    集群已弹出 20 %的节点,健康检查结果为 failed_outlier_check
    请求已分配到其余三台节点

    30秒后,弹出主机已回复正常

    再次模拟请求

    30秒后,如在时间间隔内,无新增请求,节点依旧为 failed_outlier_check有新增请求时恢复.