Prometheus 监控告警系统搭建(对接飞书告警)
Prometheus 是一套开源的系统监控报警框架,非常适合大规模集群的监控。它也是第二个加入CNCF的项目,受欢迎度仅次于 Kubernetes 的项目。本文讲解完整prometheus 监控和告警服务的搭建。
prometheus 监控是当下主流监控系统,它是多个服务组合使用的体系。整体架构预览如下:
本篇教程监控系统搭建,包括的服务有:
- prometheus
监控的主体,负责数据汇总,保存,监控数据,产生告警信息 - exporter
监控的采集端,负责数据采集 - grafana
数据可视化,负责以丰富的页面展示采集到的数据 - alertmanager
告警管理,负责告警信息处理,包括告警周期,消息优先级等 - prometheusAlert
告警的具体发送端,负责配置告警模板,发出告警信息
除了监控采集节点,其他服务均通过docker-compose部署。部署系统信息:
- 系统
:ubuntu20.04 - 服务器IP
:172.16.9.124 - docker版本
:20.10.21 - docker-compose版本
:1.29.2 - 配置文件路径
:/root/prometheus
部署prometheus
prometheus主要负责数据采集和存储,提供PromQL查询语言的支持。部署prometheus分为两个步骤:
- 准备配置文件
- 启动prometheus
准备配置文件
整个体系的配置文件在
/root/prometheus
,首先新建prometheus服务的配置文件路径
/root/prometheus/prometheus
,并在这个目录下新建:
- config 用于放置服务主要配置文件 prometheus.yml
- data 用于放置服务的数据库文件
root@ubuntu-System-Product-Name:~/prometheus# tree . -L 3
.
├── docker-compose.yaml
└── prometheus
├── config
│ └── prometheus.yml
└── data
新建prometheus.yml,prometheus服务的主配置文件
global:
scrape_interval: 30s # 每30s采集一次数据
evaluation_interval: 30s # 每30s做一次告警检测
scrape_configs:
# 配置prometheus服务本身
- job_name: prometheus
static_configs:
- targets: ['172.16.9.124:9090']
labels:
instance: prometheus
修改 data 目录的文件权限,让容器有权限在data目录里生成数据相关数据
chmod 777 data
创建 docker-compse.yml
version: '3'
services:
prometheus:
image: prom/prometheus
container_name: prometheus
restart: always
ports:
- "9090:9090"
volumes:
- /root/prometheus/prometheus/config:/etc/prometheus
- /root/prometheus/prometheus/data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
参数说明:
command:
- --config.file=/etc/prometheus/prometheus.yml 指定使用的配置文件
- --storage.tsdb.path=/prometheus 指定时序数据库的路径
- --web.enable-lifecycle 支持配置热加载
volumes:
- /root/prometheus/prometheus/config:/etc/prometheus 映射配置文件所在目录
- /root/prometheus/prometheus/data:/prometheus 映射数据库路径参数
启动prometheus
启动 docker-compse
docker-compose up -d
查看日志:
root@ubuntu-System-Product-Name:~/prometheus# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
776772d69b20 prom/prometheus "/bin/prometheus --c…" 5 minutes ago Up 5 minutes 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp prometheus
查看容器的日志:
docker logs -f 776
ts=2023-12-25T10:21:17.560Z caller=main.go:478 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-12-25T10:21:17.560Z caller=main.go:515 level=info msg="Starting prometheus" version="(version=2.32.1, branch=HEAD, revision=41f1a8125e664985dd30674e5bdf6b683eff5d32)"
ts=2023-12-25T10:21:17.561Z caller=main.go:520 level=info build_context="(go=go1.17.5, user=root@54b6dbd48b97, date=20211217-22:08:06)"
ts=2023-12-25T10:21:17.561Z caller=main.go:521 level=info host_details="(Linux 5.15.0-56-generic #62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022 x86_64 776772d69b20 (none))"
ts=2023-12-25T10:21:17.561Z caller=main.go:522 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-12-25T10:21:17.561Z caller=main.go:523 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-12-25T10:21:17.562Z caller=web.go:570 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-12-25T10:21:17.562Z caller=main.go:924 level=info msg="Starting TSDB ..."
ts=2023-12-25T10:21:17.562Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2023-12-25T10:21:17.564Z caller=head.go:488 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-12-25T10:21:17.564Z caller=head.go:522 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1.305µs
ts=2023-12-25T10:21:17.564Z caller=head.go:528 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-12-25T10:21:17.564Z caller=head.go:599 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=1
ts=2023-12-25T10:21:17.564Z caller=head.go:599 level=info component=tsdb msg="WAL segment loaded" segment=1 maxSegment=1
ts=2023-12-25T10:21:17.564Z caller=head.go:605 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=14.305µs wal_replay_duration=301.534µs total_replay_duration=327.342µs
ts=2023-12-25T10:21:17.565Z caller=main.go:945 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-12-25T10:21:17.565Z caller=main.go:948 level=info msg="TSDB started"
ts=2023-12-25T10:21:17.565Z caller=main.go:1129 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
ts=2023-12-25T10:21:17.565Z caller=main.go:1166 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=217.62µs db_storage=666666ns remote_storage=860ns web_handler=182ns query_engine=371ns scrape=90.382µs scrape_sd=10.238µs notify=450ns notify_sd=788ns rules=737ns
ts=2023-12-25T10:21:17.565Z caller=main.go:897 level=info msg="Server is ready to receive web requests."
日志很重要!日志很重要!日志很重要!