Prometheus监控Ceph集群并设置AlertManager告警

释放双眼,带上耳机,听听看~!
之前写过ceph的搭建,那么我们ceph同样也需要prometheus进行监控数据。我这里使用prometheus监控ceph并配置alertmanager告警
🤖 由 ChatGPT 生成的文章摘要

MGR

Manager:管理器 Ceph管理器守护程序(Cephmgr)负责跟踪运行时指标和Ceph群集的当前状态,包括存储利用率、当前性能指标和系统负载。Ceph管理器守护进程还托管基于python的模块,以管理和公开Ceph集群信息,包括基于web的Ceph仪表板和Restful APT。高可用性通常需要至少两个管理器。

 

CEPH安装可以参考下面文章

Ceph-deploy 快速部署Ceph集群

Ceph基础知识和基础架构

查看集群状态

[root@ceph-01 ~]# ceph -s
  cluster:
    id:     c8ae7537-8693-40df-8943-733f82049642
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 (age 2h)
    mgr: ceph-03(active, since 17h), standbys: ceph-01, ceph-02
    mds: cephfs-abcdocker:1 cephfs:1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-02=up:active} 1 up:standby
    osd: 4 osds: 4 up (since 17h), 4 in (since 3w)
    rgw: 2 daemons active (ceph-01, ceph-02)

  task status:

  data:
    pools:   13 pools, 656 pgs
    objects: 3.77k objects, 12 GiB
    usage:   41 GiB used, 139 GiB / 180 GiB avail
    pgs:     656 active+clean

  io:
    client:   2.0 KiB/s wr, 0 op/s rd, 0 op/s wr

#目前我们集群是ceph-03提供服务,ceph-02,ceph-01为从节点

Ceph

ceph-03为提供服务的节点,所以先需要在ceph-03开启dashboard

我们首先需要安装ceph dashboard

#在ceph-mgr节点开启dashboard

yum install -y ceph-mgr-dashboard

检查是否开启dashboard模块

默认没有开启

#如果可以过滤出来,则不需要开启
ceph mgr module ls | grep dashboard

#开启dashboard
ceph mgr module enable dashboard
#提示报错可以使用
ceph mgr module enable dashboard --force

设置mgr-dashboard监听信息

#设置监听地址
ceph config set mgr mgr/dashboard/server_addr 0.0.0.0

#设置端口号
ceph config set mgr mgr/dashboard/server_port 7000

#关闭ssl验证
ceph config set mgr mgr/dashboard/ssl false

创建管理员用户

最新的ceph dashboard不支持直接在命令行里面创建用户的密码,所以需要.先创建一个包含用户密码的文件

# 创建密码文本
cat >/opt/secretkey<<EOF
123123
EOF
# 使用该文本作为密钥 
ceph dashboard ac-user-create admin administrator -i /opt/secretkey 
#administrator为管理员权限

Ceph Dashboard

创建完用户名密码,我们就可以访问mgr:7000 端口测试

mgr-dashboard可以设置多个节点,然后nginx upstream代理

[root@ceph-03 ~]# lsof -i:7000
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 9568 ceph   36u  IPv4 353517      0t0  TCP *:afs3-fileserver (LISTEN)

还可以使用ceph命令查看dashboard节点

[root@ceph-03 ~]# ceph mgr services
{
    "dashboard": "http://ceph-03:7000/"
}

访问ceph mgr ip:7000 端口

http://ceph-mgr节点:7000/#/login?returnUrl=%2Fdashboard

输入用户名密码

Prometheus监控Ceph集群并设置AlertManager告警

登陆成功后界面

Prometheus监控Ceph集群并设置AlertManager告警

Prometheus

Promethues相关知识可以参考,Docker版本安装

Prometheus 监控MySQL数据库

开启Prometheus metric模块

ceph mgr module enable prometheus

默认ceph mgr metric端口为9283

[root@ceph-03 ~]# lsof -i:9283
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 9568 ceph   33u  IPv6 360155      0t0  TCP *:callwaveiam (LISTEN)

#测试metric
[root@ceph-03 ~]# curl 127.0.0.1:9283/metrics|head
# HELP ceph_mds_mem_dir_minus Directories closed
# TYPE ceph_mds_mem_dir_minus counter
ceph_mds_mem_dir_minus{ceph_daemon="mds.ceph-01"} 0.0
# HELP ceph_mds_mem_dir_plus Directories opened
# TYPE ceph_mds_mem_dir_plus counter
ceph_mds_mem_dir_plus{ceph_daemon="mds.ceph-01"} 12.0
# HELP ceph_osd_flag_norebalance OSD Flag norebalance
# TYPE ceph_osd_flag_norebalance untyped
ceph_osd_flag_norebalance 0.0

加入Prometheus监控

  - job_name: 'ceph-mgr'
    static_configs:
    - targets: ['82.157.142.150:7002']

Prometheus监控Ceph集群并设置AlertManager告警

reload Prometheus,自行重启Prometheus

查看效果

Prometheus监控Ceph集群并设置AlertManager告警

AlertManager告警

监控项设置完了,我们设置一下ceph的告警规则

alertmanager搭建可以看下面的文章

AlertManager 微信告警配置

[root@prometheus ~]# vim /etc/prometheus/rules/ceph_exporter.yaml

groups:
    - name: Ceph status
      rules:
      - alert: Ceph 实例不健康
        expr: ceph_health_status != 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph 实例不健康{{ $labels.instance }})
          description: "Ceph instance unhealthyn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: 检测到Ceph监视器时钟偏差
        expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph monitor clock skew (instance {{ $labels.instance }})
          description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settingsn  VALUE = {{ $value     }}n  LABELS = {{ $labels }}"
      - alert: Ceph监视器存储空间不足
        expr: ceph_monitor_avail_percent < 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph monitor low space (instance {{ $labels.instance }})
          description: "Ceph monitor storage is low.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: Ceph对象存储守护进程关闭
        expr: ceph_osd_up == 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph OSD Down (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon Downn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

      - alert: Ceph高OSD延迟
        expr: ceph_osd_perf_apply_latency_seconds > 5
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: Ceph high OSD latency (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"

      - alert: CephOSD空间不足
        expr: ceph_osd_utilization > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph OSD low space (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon is going out of space. Please add more disks.n  VALUE = {{ $value }}n      LABELS = {{ $labels }}"

      - alert: CephOSD重新加权
        expr: ceph_osd_weight < 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph OSD reweighted (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon takes too much time to resize.n  VALUE = {{ $value }}n  LABELS = {{     $labels }}"
      - alert: CephPG下降
        expr: ceph_pg_down > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG down (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are down. Please ensure that all the data are available.n  VALUE = {{     $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG不完整
        expr: ceph_pg_incomplete > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG incomplete (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.n  VALUE =     {{ $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG不一致
        expr: ceph_pg_inconsistent > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG inconsistent (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG激活时间长
        expr: ceph_pg_activating > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG activation long (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are too long to activate.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: Ceph PG回填已满
        expr: ceph_pg_backfill_toofull > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG backfill full (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be     unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.n  VALUE = {{ $value }}n  LABELS     = {{ $labels }}"
      - alert: Ceph PG不可用
        expr: ceph_pg_total - ceph_pg_active > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG unavailable (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are unavailable.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

重启Prometheus和alertmanager

[root@prometheus ~]# docker restart prometheus_new
prometheus_new
[root@prometheus ~]# docker restart alertmanager
alertmanager

进入Prometheus就可以看到alert,看到ceph 告警信息

Prometheus监控Ceph集群并设置AlertManager告警

Grafana

Grafana Docker安装

Prometheus 监控MySQL数据库

Grafana上传模板

#我提供的grafana下载地址
wget https://d.frps.cn/file/tools/grafana/ceph/ceph-cluster_rev1.json
#将这个json下载下来

#windows下载
https://d.frps.cn/?tools/grafana/ceph

Prometheus监控Ceph集群并设置AlertManager告警

手动上传json
Prometheus监控Ceph集群并设置AlertManager告警

或者使用这个id2842
https://grafana.com/grafana/dashboards/2842

9966也可以使用

Prometheus监控Ceph集群并设置AlertManager告警
Prometheus监控Ceph集群并设置AlertManager告警

告警测试

我们将ceph-02重启,检查是否可以告警
Prometheus监控Ceph集群并设置AlertManager告警

ceph状态已经异常

[root@ceph-01 ~]# ceph -s
  cluster:
    id:     c8ae7537-8693-40df-8943-733f82049642
    health: HEALTH_WARN
            1 filesystem is degraded
            insufficient standby MDS daemons available
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 3803/11409 objects degraded (33.333%), 330 pgs degraded
            1/3 mons down, quorum ceph-01,ceph-03

  services:
    mon: 3 daemons, quorum ceph-01,ceph-03 (age 16s), out of quorum: ceph-02
    mgr: ceph-03(active, since 40m), standbys: ceph-01, ceph-02
    mds: cephfs-abcdocker:1 cephfs:1/1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-03=up:replay}
    osd: 4 osds: 3 up (since 15s), 4 in (since 3w)
    rgw: 1 daemon active (ceph-01)

  task status:

  data:
    pools:   13 pools, 656 pgs
    objects: 3.80k objects, 13 GiB
    usage:   41 GiB used, 139 GiB / 180 GiB avail
    pgs:     3803/11409 objects degraded (33.333%)
             330 active+undersized+degraded
             326 active+undersized

  io:
    client:   115 KiB/s wr, 0 op/s rd, 3 op/s wr

Prometheus 已经提示告警

Prometheus监控Ceph集群并设置AlertManager告警

等待alertmanager发送告警内容

具体发送内容,自行优化即可

Prometheus监控Ceph集群并设置AlertManager告警

给TA打赏
共{{data.count}}人
人已打赏
DockerGrafanaprometheus

Prometheus Blackbox域名SSL证书监控并设置AlertManager告警

2022-6-26 17:29:16

CephLinux

Linux客户端远程挂载Ceph RBD并设置开机启动自动挂载

2022-7-16 17:06:06

2 条回复 A文章作者 M管理员
  1. 眈꯭眈꯭逐꯭逐꯭

    告警模板是发给企业微信的吗? 模板可不可以分享下?

  2. […] […]

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索