高科技Startup构建监控体系之路

    公司当前机器总数100台左右, 没有监控, 总是在机器挂了才知道. 业务问题也只能依靠测试报障. 因为内部涉及多个K8s集群. 每个环境有独立的监控,日志收集系统, 所以需要一个的运维监控系统.

    尝试过Grafana+ Mimir + Loki的方式.二次开发成本过大, 并且短期内不能有效告警. 遂放弃. 接着尝试夜莺V5.

    通过夜莺监控,免去了我们对告警通知的开发成本, 传统的 Grafana 或者 Alertmanager, 都需要二次对接自己的IM. 而夜莺支持了业务组或者部门的功能, 我们就可以利用这些功能做到告警细化, 并不需要再次对接IM平台. 并且有着更详细、易用的告警配置. 可以做到开箱即用, 学习成本近乎为零。

    以下是实践过程,会从系统运维,业务运维,数据库运维等几个方面来进行监控系统搭建.

    https://github.com/ccfos/nightingale

    这里选用最简单的Docker Compose 方式创建夜莺. 正如文档所说如果不是Docker专家, 不建议以这样的形式创建.

    夜莺文档

    启动命令如下所示.

    服务启动之后,浏览器访问nwebapi的端口,即18000,默认用户是root,密码是root.2020

    主机监控安装

    这里的主机监控agent 选用的grafana-agent, grafana-agent 集成了绝大部分会使用到的exporter, 做到了All IN ONE.
    并且支持Push 模式,简化流程, 这样在流程上只需要在主机启动时,预装grafana-agent, 由grafana-agent主动Push 到中心即可.

    安装脚本如下所示:

    这个脚本有如下几个注意点:

    1. remote_write 地址要根据自己部署夜莺的地址修改,将x.x.x.x更换为自己的IP即可

    2. $_hostip: 这个建议写为主机IP, 因为对运维来说IP才是最直观的数据

    1. function InstallMonitor(){
    2. [ ! -f /usr/local/bin/grafana-agent ] && wget -O /usr/local/bin/grafana-agent https://lcc-init.oss-cn-hangzhou-internal.aliyuncs.com/grafana-agent
    3. chmod +x /usr/local/bin/grafana-agent
    4. mkdir -p /metrics /etc/grafana-agent
    5. cat >/etc/systemd/system/grafana-agent.service <<EOF
    6. [Unit]
    7. Description="grafana-agent"
    8. After=network.target
    9. [Service]
    10. Type=simple
    11. ExecStart=/usr/local/bin/grafana-agent -config.file /etc/grafana-agent/grafana-agent.yml
    12. WorkingDirectory=/usr/local/bin
    13. SuccessExitStatus=0
    14. LimitNOFILE=65536
    15. StandardOutput=syslog
    16. StandardError=syslog
    17. SyslogIdentifier=grafana-agent
    18. KillMode=process
    19. KillSignal=SIGQUIT
    20. TimeoutStopSec=5
    21. Restart=always
    22. [Install]
    23. WantedBy=multi-user.target
    24. EOF
    25. chmod 0644 /etc/systemd/system/grafana-agent.service
    26. cat >/etc/grafana-agent/grafana-agent.yml <<EOF
    27. server:
    28. log_level: info
    29. http_listen_port: 12345
    30. metrics:
    31. wal_directory: /metrics
    32. global:
    33. scrape_interval: 15s
    34. scrape_timeout: 10s
    35. remote_write:
    36. # 远程写入的地址需要根据云上云下环境来切换.
    37. - url: http://x.x.x.x:19000/prometheus/v1/write
    38. integrations:
    39. agent:
    40. enabled: true
    41. node_exporter:
    42. enabled: true
    43. instance: "$_hostip"
    44. include_exporter_metrics: true
    45. process_exporter:
    46. enabled: true
    47. instance: "$_hostip"
    48. process_names:
    49. - name: "{{.Comm}}"
    50. cmdline:
    51. - '.+'
    52. EOF
    53. systemctl daemon-reload
    54. systemctl enable --now grafana-agent
    55. }

    BlackBox Exporter

    下载地址:

    下载二进制文件并解压到/usr/local/bin/

    1. function InstallBlackboxExporter(){
    2. cat >/etc/systemd/system/blackbox_exporter.service <<EOF
    3. [Unit]
    4. Description="blackbox_exporter"
    5. After=network.target
    6. [Service]
    7. Type=simple
    8. ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox-exporter/blackbox.yml
    9. WorkingDirectory=/usr/local/bin
    10. SuccessExitStatus=0
    11. LimitNOFILE=65536
    12. StandardOutput=syslog
    13. StandardError=syslog
    14. SyslogIdentifier=blackbox_exporter
    15. KillMode=process
    16. KillSignal=SIGQUIT
    17. TimeoutStopSec=5
    18. Restart=always
    19. [Install]
    20. WantedBy=multi-user.target
    21. EOF
    22. chmod 0644 /etc/systemd/system/blackbox_exporter.service
    23. modules:
    24. http_2xx:
    25. prober: http
    26. http_post_2xx:
    27. prober: http
    28. http:
    29. method: POST
    30. tcp_connect:
    31. prober: tcp
    32. pop3s_banner:
    33. prober: tcp
    34. tcp:
    35. query_response:
    36. - expect: "^+OK"
    37. tls: true
    38. tls_config:
    39. insecure_skip_verify: false
    40. grpc:
    41. prober: grpc
    42. grpc:
    43. tls: true
    44. preferred_ip_protocol: "ip4"
    45. grpc_plain:
    46. prober: grpc
    47. grpc:
    48. tls: false
    49. service: "service1"
    50. ssh_banner:
    51. prober: tcp
    52. tcp:
    53. query_response:
    54. - expect: "^SSH-2.0-"
    55. - send: "SSH-2.0-blackbox-ssh-check"
    56. irc_banner:
    57. prober: tcp
    58. tcp:
    59. query_response:
    60. - send: "NICK prober"
    61. - send: "USER prober prober prober :prober"
    62. - expect: "PING :([^ ]+)"
    63. send: "PONG ${1}"
    64. - expect: "^:[^ ]+ 001"
    65. icmp:
    66. EOF
    67. systemctl daemon-reload
    68. systemctl enable --now blackbox_exporter
    69. }

    下载地址: https://github.com/prometheus/mysqld_exporter

    下载二进制文件并解压到/usr/local/bin/

    需要监听的数据库执行如下SQL:

    1. create user 'exporter'@'%' identified by 'xxxxx';
    2. GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%' WITH MAX_USER_CONNECTIONS 3;
    3. flush privileges;

    安装脚本如下:

    mysqld_exporter.cnf: 中密码账户为上面执行SQL创建的用户密码.

    1. function InstallMysqldExporter(){
    2. cat >/etc/systemd/system/mysqld_exporter.service <<EOF
    3. [Unit]
    4. Description="mysqld_exporter"
    5. After=network.target
    6. [Service]
    7. Type=simple
    8. ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysqld_exporter.cnf --collect.auto_increment.columns --collect.binlog_size --collect.global_status --collect.global_variables --collect.info_schema.innodb_metrics --collect.info_schema.innodb_cmp --collect.info_schema.innodb_cmpmem --collect.info_schema.processlist --collect.info_schema.query_response_time --collect.info_schema.tables --collect.info_schema.tablestats --collect.info_schema.userstats --collect.perf_schema.eventswaits --collect.perf_schema.file_events --collect.perf_schema.indexiowaits --collect.perf_schema.tableiowaits --collect.perf_schema.tablelocks --collect.slave_status
    9. WorkingDirectory=/usr/local/bin
    10. SuccessExitStatus=0
    11. LimitNOFILE=65536
    12. StandardOutput=syslog
    13. StandardError=syslog
    14. SyslogIdentifier=mysqld_exporter
    15. KillMode=process
    16. KillSignal=SIGQUIT
    17. TimeoutStopSec=5
    18. Restart=always
    19. [Install]
    20. WantedBy=multi-user.target
    21. EOF
    22. chmod 0644 /etc/systemd/system/mysqld_exporter.service
    23. cat >/etc/mysqld_exporter.cnf <<EOF
    24. [client]
    25. user=exporter
    26. password=xxxx
    27. host=x.x.x.x
    28. port=3306
    29. EOF
    30. systemctl daemon-reload
    31. systemctl enable --now mysqld_exporter
    32. }

    consul + consul-template 动态生成配置

    安装 Consul

    -bind-client 需要替换为本机IP

    1. function InstallConsul(){
    2. yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
    3. yum -y install consul
    4. mkdir -p /data/consul
    5. cat >/etc/systemd/system/consul.service <<EOF
    6. [Unit]
    7. Description="consul"
    8. After=network.target
    9. [Service]
    10. Type=simple
    11. ExecStart=/usr/bin/consul agent -server -bootstrap-expect 1 -bind=x.x.x.x -client=x.x.x.x -data-dir=/data/consul -node=agent-one -config-dir=/etc/consul.d -ui
    12. WorkingDirectory=/usr/bin/
    13. SuccessExitStatus=0
    14. LimitNOFILE=65536
    15. StandardOutput=syslog
    16. StandardError=syslog
    17. SyslogIdentifier=consul
    18. KillMode=process
    19. KillSignal=SIGQUIT
    20. TimeoutStopSec=5
    21. Restart=always
    22. [Install]
    23. WantedBy=multi-user.target
    24. EOF
    25. chmod 0644 /etc/systemd/system/consul.service
    26. systemctl daemon-reload
    27. systemctl enable --now consul
    28. }

    安装Consul-template

    安装脚本如下所示:

    x.x.x.x 替换为夜莺地址 , a.b.c.d 替换为consul部署地址

    1. wget https://releases.hashicorp.com/consul-template/0.29.0/consul-template_0.29.0_linux_amd64.zip
    2. unzip consul-template_0.29.0_linux_amd64.zip
    3. chmod +x consul-template
    4. mv consul-template /usr/local/bin/consul-template
    5. mkdir -p /etc/consul-template/template
    6. cat > /etc/consul-template/consul-template.conf << EOF
    7. log_level = "warn"
    8. syslog {
    9. # This enables syslog logging.
    10. enabled = true
    11. # This is the name of the syslog facility to log to.
    12. facility = "LOCAL5"
    13. }
    14. consul {
    15. # auth {
    16. # enabled = true
    17. # username = "test"
    18. # password = "test"
    19. # }
    20. # 注意替换为consul地址
    21. address = "a.b.c.d:8500"
    22. retry {
    23. enabled = true
    24. attempts = 12
    25. backoff = "250ms"
    26. # If max_backoff is set to 10s and backoff is set to 1s, sleep times
    27. # would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
    28. max_backoff = "3m"
    29. }
    30. }
    31. template {
    32. source = "/etc/consul-template/templates/url-monitor.ctmpl"
    33. destination = "/home/nightingale-main/docker/prometc/conf.d/url/url.yaml"
    34. command = "curl -X POST http://x.x.x.x:9090/-/reload"
    35. command_timeout = "60s"
    36. backup = true
    37. wait {
    38. max = "20s"
    39. }
    40. }
    41. template {
    42. source = "/etc/consul-template/templates/icmp-monitor.ctmpl"
    43. destination = "/home/nightingale-main/docker/prometc/conf.d/icmp/icmp.yaml"
    44. command = ""
    45. command_timeout = "60s"
    46. backup = true
    47. wait {
    48. min = "2s"
    49. max = "20s"
    50. }
    51. }
    52. EOF
    53. cat > /etc/consul-template/consul-template.conf/template/url-monitor.ctmpl <<EOF
    54. - targets:
    55. {{- range ls "blackbox/url/http200" }}
    56. - http://{{ .Key }}{{ .Value }}
    57. {{- end }}
    58. EOF
    59. cat > /etc/consul-template/consul-template.conf/template/icmp-monitor.ctmpl <<EOF
    60. {{- range ls "blackbox/icmp" }}
    61. - targets:
    62. - {{ .Key }}
    63. labels:
    64. instance: {{ .Key }}
    65. EOF
    66. cat > /etc/systemd/system/consul-template.service <<EOF
    67. [Unit]
    68. Description="consul-template"
    69. After=network.target
    70. [Service]
    71. Type=simple
    72. ExecStart=/usr/local/bin/consul-template -config /etc/consul-template/consul-template.conf
    73. WorkingDirectory=/usr/local/bin
    74. SuccessExitStatus=0
    75. LimitNOFILE=65536
    76. StandardOutput=syslog
    77. StandardError=syslog
    78. SyslogIdentifier=consul-template
    79. KillMode=process
    80. KillSignal=SIGQUIT
    81. TimeoutStopSec=5
    82. Restart=always
    83. [Install]
    84. WantedBy=multi-user.target
    85. EOF
    86. systemctl daemon-reload
    87. systemctl enable --now consul-template.service

    配置Consul K/V 动态生成URL监控

    添加如下K/V,K/V 对应上文*.ctmpl 文件中渲染地址. 在这里Key 为域名,Values 为路径

    Conusl配置

    修改Promtheus配置

    nightingale-main/docker/prometc/prometheus.yml追加如下内容:

    1. - job_name: MySQL
    2. static_configs:
    3. - targets:
    4. - x.x.x.x:9104
    5. labels:
    6. instance: MySQL-dev
    7. - job_name: process
    8. static_configs:
    9. - targets:
    10. - x.x.x.x:9256
    11. - job_name: 'blackbox-url-monitor'
    12. metrics_path: /probe
    13. params:
    14. module: [http_2xx] # Look for a HTTP 200 response.
    15. file_sd_configs:
    16. - refresh_interval: 1m
    17. files:
    18. - ./conf.d/url/*.yaml
    19. relabel_configs:
    20. - source_labels: [__address__]
    21. target_label: __param_target
    22. - source_labels: [__param_target]
    23. target_label: instance
    24. - target_label: __address__
    25. replacement: x.x.x.x:9115
    26. - job_name: 'blackbox-icmp-monitor'
    27. scrape_interval: 1m
    28. metrics_path: /probe
    29. params:
    30. module: [icmp]
    31. file_sd_configs:
    32. - refresh_interval: 1m
    33. files:
    34. - ./conf.d/icmp/*.yaml
    35. relabel_configs:
    36. - source_labels: [__address__]
    37. target_label: __param_target
    38. - target_label: __address__
    39. replacement: x.x.x.x:9115

    nightingale-main/docker/prometc/ 下创建目录conf.d. 命令如下:

    1. cd nightingale-main/docker/prometc/
    2. mkdir -p conf.d/{icmp,url}

    重启promtheus,命令如下所示:

    1. docker restart prometheus

    重启后检查prometheus状态

    感谢夜莺社区支持.

    1. 大前提, 夜莺版本高于5.9.2
    2. 已有Loki. 并且Loki已经支持多租户.

    Loki的配置在这里不做赘述,网上教程太多了.

    docker-compose.yml 追加如下内容, 与nserver 同级

    生成lokinserver容器的配置文件.操作如下.

    1. cp -r n9eetc lokin9eetc
    2. cd lokin9eetc

    修改lokin9eetc/server.conf文件中Reader字段,内容如下:

    1. [Reader]
    2. # prometheus base url
    3. Url = "http://loki.xxx.xxx/loki/"
    4. # Basic auth username
    5. BasicAuthUser = ""
    6. # Basic auth password
    7. BasicAuthPass = ""
    8. # timeout settings, unit: ms
    9. Timeout = 30000
    10. DialTimeout = 10000
    11. TLSHandshakeTimeout = 30000
    12. ExpectContinueTimeout = 1000
    13. IdleConnTimeout = 90000
    14. # time duration, unit: ms
    15. KeepAlive = 30000
    16. MaxConnsPerHost = 0
    17. MaxIdleConns = 100
    18. MaxIdleConnsPerHost = 10
    19. Headers = ["X-Scope-OrgID","lcc-loki"]

    修改配置文件nightingale-main/docker/n9eetc/webapi.conf, 追加如下内容

    如果开启多租户记得传Headers, 如果没开,则去除Headers字段 Loki的API中带loki前缀的都是兼容prometheus风格的API 所以一定要加. Prom字段替换为自己的域名

    1. [[Clusters]]
    2. # Prometheus cluster name
    3. Name = "Loki"
    4. # # Prometheus APIs base url
    5. Prom = "http://loki.xxx.xxx/loki/"
    6. # # Basic auth username
    7. BasicAuthUser = ""
    8. # Basic auth password
    9. BasicAuthPass = ""
    10. # timeout settings, unit: ms
    11. Timeout = 30000
    12. DialTimeout = 10000
    13. TLSHandshakeTimeout = 30000
    14. ExpectContinueTimeout = 1000
    15. IdleConnTimeout = 90000
    16. # time duration, unit: ms
    17. KeepAlive = 30000
    18. MaxConnsPerHost = 0
    19. MaxIdleConns = 100
    20. MaxIdleConnsPerHost = 100
    21. Headers = ["X-Scope-OrgID","lcc-loki"]

    重启夜莺监控:

    1. docker-compose up -d

    CPU利用率 > 90

    1. (100-(avg by (mode, instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])))*100) > 90

    Innode 利用率>90

    1. (100 - ((node_filesystem_files_free * 100) / node_filesystem_files))>90

    sshd 服务挂了

    1. (namedprocess_namegroup_num_procs{groupname="sshd"}) == 0

    内存利用率 > 95

    1. (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - (node_memory_Cached_bytes + node_memory_Buffers_bytes))/node_memory_MemTotal_bytes*100 > 95

    文件句柄 > 90

    1. (node_filefd_allocated{}/node_filefd_maximum{}*100)

    IO wait > 30%

    过去一分钟IOutil > 80

    1. (rate(node_disk_io_time_seconds_total{} [1m]) *100) > 80

    Ping > 1s

    1. avg_over_time(probe_icmp_duration_seconds[1m]) > 1

    平均负载>2

    1. (avg(node_load1) by(instance)/count by (instance)(node_cpu_seconds_total{mode='idle'})) >2

    TCP重传率>5%

    1. (rate(node_netstat_Tcp_RetransSegs{}[5m])/ rate(node_netstat_Tcp_OutSegs{}[5m])*100) > 5

    磁盘利用率 > 85%

    1. (100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) ) > 85

    节点重启

    1. node_reboot_required > 0

    业务运维

    我们是GO应用,其他应用根据需要设定

    一分钟内日志ERROR>10

    日志这里主要选,我们上面添加的Loki集群

    error日志

    URL探测不通

    1. probe_http_status_code <= 199 OR probe_http_status_code >= 400

    过去一分钟出现Panic

    Panic日志

    数据库运维

    仅罗列部分, 更多可以在导入规则中查找

    mysql规则

    数据库重启

      连接数超过80%

      1. avg by (instance) (mysql_global_status_threads_connected) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80`

      最近一分钟有慢查询