环境要求:
根据环境下载 对应环境的:prometheus包
根据环境下载 对应prometheus Service 端环境的 alertmanger
Grafana图形工具
常用的exporter Linux环境: node_exporter Windows环境: windows_exporter 下载链接
端口
工具 | 端口号 |
---|---|
prometheus | 9090 |
Grafan | 3000 |
windows_exporter | 9182 |
Node_exporter | 9100 |
postgres_exporter | 9187 |
Stmp(邮箱) | 25 QQ邮箱 ,465 阿里邮箱 |
1.先上 prometheus的架构图吧
参考链接 :prometheus架构原理
理解了 prometheus的架构原理后 我们开始 搭建系统吧
- 下载完 prometheus的包后 点击 prometheus.exe 文件 ,访问http://localhost/:9090 prometheus 配置好了
现在 我们 prometheus 上 没有配置任何的监控目标 ,下面 我们配置 二个监控目标
环境 | IP | 监控身份 |
---|---|---|
Linux | 192.36.168.1 | 被监控的客户端 |
Windows | 192.36.168.2 | 被监控的客户端 |
Windows | 192.36.168.3 | prometheus Service端 |
3.我们在 不同的客户端上,下载对应不同环境系统的exporter导出器
例如 Linux 环境下:
ubuntu@ip-192.36.168.1:~$ wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz --下载压缩包
ubuntu@ip-192.36.168.1:~$ tar xvfz node_exporter-*.*-amd64.tar.gz --解压压缩包
--配置 linux 服务启动的时候 启动 node_exporter
ubuntu@ip-192.36.168.1:~$ cd /etc/systemd/system/ --到系统目录下 创建 server配置文件
ubuntu@ip-192.36.168.1:~$ sudo vi node_exporter.service
--填入以下内容
[Service]
User=root
ExecStart = /usr/local/bin/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
[Unit]
Description=node_exporter
After=network.target
ubuntu@ip-10-0-1-4:~$ sudo systemctl start node_exporter --启动node_exporter服务 配置成功
windows 环境下 点击 下载的包 即可启动成功
- 配置完 监控客户端后, 配置prometheus Service 服务端 prometheus.yml文件 获取 客户端监控的指标数据 (此处 配置的是prometheus 服务发现模式 )
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# - localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
--告警规则文件
rule_files:
- "node-up.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# 在这里 添加一个监控 任务 去实时的 加载 配置文件里面的 targets 监控目标 以下 内容
- job_name: 'dynami_service'
file_sd_configs:
- files: ['D:\监控系统\prometheus-2.21.0-rc.0.windows-amd64\conf\exporter.yml']
# 监测 频率
refresh_interval: 10s
监控客户端目标 的 exporter.yml 配置文件
- targets: ['192.36.168.2:9182']
labels: --labels 下的标签都是 自定义标签
app: '本机 windows(1)'
env: 'windows Service'
# region: 'us-west-2'
- targets: ['192.36.168.1:9100']
labels:
app: 'example-linux'
env: 'Linux'
# region: 'ap-southeast-1'
exporter.json 格式
[
{
"targets": [
"monitor.gimyingao:9100"
],
"labels": {
"app":"ubuntu@52.83.68.66-Linux",
"hostname": "test1",
"env":"Linux_Service"
}
},
{
"targets": [
"52.82.5.91:9187"
],
"labels": {
"hostname": "test2",
"app":"ubuntu@52.82.5.91-pgsql",
"env":"pgsql_Service"
}
},
{
"targets": [
"monitor.gimyingao:9121"
],
"labels":{
"app":"redis_exporter-Linux",
"hostname":"test3",
"env":"redis"
}
},
{
"targets":[
"173.0.1.98:9100"
],
"labels":{
"app":"@ubuntu-173.0.1.98-Linux",
"hostname":"test4",
"env":"Linux-sercice-app1-server"
}
},
{
"targets":[
"173.0.1.83:9100"
],
"labels":{
"app":"@ubuntu-173.0.1.83-Linux",
"hostname":"test5",
"env":"Linux-sercice-app2-server"
}
}
]
5.现在 prometheus 开始收集 监控目标的指标数据了,如下:
- 现在 已经获取到监控客户端的数据了,有监控 就一定有报警,不然 就不完美了 。 接下来 我们 配置 alertmanger
配置 alertmanger .yml 文件
global:
resolve_timeout: 5m
# smtp_from: '{{ template "email.from" . }}'
smtp_from: 'haoyacong@gimmake.com'
smtp_smarthost: 'smtp.mxhichina.com:465'
# smtp_auth_username: '{{ template "email.from" . }}'
smtp_auth_username: 'haoyacong@gimmake.com'
smtp_auth_password: 'Haoyacong515'
# smtp_auth_password: 'dpbbqhaxwltxdcia'
smtp_require_tls: false
# smtp_hello: 'mailsso.mxhichina'
templates:
--邮件自定义模板
- 'email.tmpl'
route:
group_by: ['alertname']
group_wait: 15s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '2716966498@qq.com , weilina@gimmake.com'
# - to: '{{ template "email.to" . }}'
#
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
email.tmpl 文件
{{ define "email.from" }}2716966498@qq.com{{ end }}
{{ define "email.to" }}2716966498@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
=========end==========<br>
{{ end }}
{{ end }}
有了 告警 发件人和收件人 ,现在 我们配置一下 触发告警的规则
告警规则文件 (常用的规则) 官方的规则
groups:
- name: node-up.yml
rules:
- alert: Linux_cpu
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 90 # 自定义摘要 10
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 主机内存不足 (剩余<10%)" # 自定义摘要 Linux
- alert: Prometheus_task_down
expr: absent(up{job="my-job"})
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Prometheus 的工作宕机" # 自定义摘要
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 5m
labels:
severity: warning
annotations:
summary: " Prometheus Service 无法连接到报警器" # 自定义摘要
- alert: PrometheusConfigurationReloadFailure
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: " Prometheus Service 配置重新加载失败" # 自定义摘要
- alert: service_down
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 目标 exporter 意外宕机!!" # 自定义摘要
- alert: Linux_internal
expr: rate(node_vmstat_pgmajfault[1m]) > -1 # 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机内存承受内存压力过大" # 自定义摘要 Linux
- alert: Linux_pull_datas
expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机网络接口可能接收到太多数据(> 100 MB / s)" # 自定义摘要 Linux
- alert: Linux_push_datas
expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 0 # 100
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机网络接口可能正在发送过多数据(> 100 MB / s)" # 自定义摘要 Linuxs
# - alert: windows_exporter_down
# expr: windows_exporter_collector_success == 0
# for: 5m
# labels:
# severity: warning
# annotations:
# summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} exporter 服务关闭" # 自定义摘要
- alert: windows_cpu
expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 10 # 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 CPU使用率超过80%" # 自定义摘要
- alert: windows_internal
expr: 100 * (windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes >20 # 90
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 内存使用率超过90%" # 自定义摘要
- alert: windows_panl
expr: 100.0 - 100 * ((windows_logical_disk_free_bytes{} / 1024 / 1024 ) / (windows_logical_disk_size_bytes{} / 1024 / 1024)) > 50 # 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 {{ $labels.volume }} 磁盘使用率超过80%" # 自定义摘要
注意 配置 yml配置文件时 不可以使用tab键 要用空格键
- 告警和 Service 端 已经配置好了,prometheus 也有自带的图像 但是 不太美观 ,我们使用 Grafana 工具
点击 grafana-server.exe 文件后 在本地 访问 http://localhost:3000
账户,密码 默认是 admin
8. 配置 Grafana 数据源为prometheus
9.配置数据源后,我们 添加 不同环境的监控指标数据的页面模板
node_exporter 的指标数据的模板
windows_exporter的指标数据的模板
可以去 官网查询 合适的模板。
配置 模板 如下
环境要求:
根据环境下载 对应环境的:prometheus包
根据环境下载 对应prometheus Service 端环境的 alertmanger
Grafana图形工具
常用的exporter Linux环境: node_exporter Windows环境: windows_exporter 下载链接
端口
工具 | 端口号 |
---|---|
prometheus | 9090 |
Grafan | 3000 |
windows_exporter | 9182 |
Node_exporter | 9100 |
postgres_exporter | 9187 |
Stmp(邮箱) | 25 QQ邮箱 ,465 阿里邮箱 |
1.先上 prometheus的架构图吧
参考链接 :prometheus架构原理
理解了 prometheus的架构原理后 我们开始 搭建系统吧
- 下载完 prometheus的包后 点击 prometheus.exe 文件 ,访问http://localhost/:9090 prometheus 配置好了
现在 我们 prometheus 上 没有配置任何的监控目标 ,下面 我们配置 二个监控目标
环境 | IP | 监控身份 |
---|---|---|
Linux | 192.36.168.1 | 被监控的客户端 |
Windows | 192.36.168.2 | 被监控的客户端 |
Windows | 192.36.168.3 | prometheus Service端 |
3.我们在 不同的客户端上,下载对应不同环境系统的exporter导出器
例如 Linux 环境下:
ubuntu@ip-192.36.168.1:~$ wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz --下载压缩包
ubuntu@ip-192.36.168.1:~$ tar xvfz node_exporter-*.*-amd64.tar.gz --解压压缩包
--配置 linux 服务启动的时候 启动 node_exporter
ubuntu@ip-192.36.168.1:~$ cd /etc/systemd/system/ --到系统目录下 创建 server配置文件
ubuntu@ip-192.36.168.1:~$ sudo vi node_exporter.service
--填入以下内容
[Service]
User=root
ExecStart = /usr/local/bin/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
[Unit]
Description=node_exporter
After=network.target
ubuntu@ip-10-0-1-4:~$ sudo systemctl start node_exporter --启动node_exporter服务 配置成功
windows 环境下 点击 下载的包 即可启动成功
- 配置完 监控客户端后, 配置prometheus Service 服务端 prometheus.yml文件 获取 客户端监控的指标数据 (此处 配置的是prometheus 服务发现模式 )
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# - localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
--告警规则文件
rule_files:
- "node-up.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# 在这里 添加一个监控 任务 去实时的 加载 配置文件里面的 targets 监控目标 以下 内容
- job_name: 'dynami_service'
file_sd_configs:
- files: ['D:\监控系统\prometheus-2.21.0-rc.0.windows-amd64\conf\exporter.yml']
# 监测 频率
refresh_interval: 10s
监控客户端目标 的 exporter.yml 配置文件
- targets: ['192.36.168.2:9182']
labels: --labels 下的标签都是 自定义标签
app: '本机 windows(1)'
env: 'windows Service'
# region: 'us-west-2'
- targets: ['192.36.168.1:9100']
labels:
app: 'example-linux'
env: 'Linux'
# region: 'ap-southeast-1'
exporter.json 格式
[
{
"targets": [
"monitor.gimyingao:9100"
],
"labels": {
"app":"ubuntu@52.83.68.66-Linux",
"hostname": "test1",
"env":"Linux_Service"
}
},
{
"targets": [
"52.82.5.91:9187"
],
"labels": {
"hostname": "test2",
"app":"ubuntu@52.82.5.91-pgsql",
"env":"pgsql_Service"
}
},
{
"targets": [
"monitor.gimyingao:9121"
],
"labels":{
"app":"redis_exporter-Linux",
"hostname":"test3",
"env":"redis"
}
},
{
"targets":[
"173.0.1.98:9100"
],
"labels":{
"app":"@ubuntu-173.0.1.98-Linux",
"hostname":"test4",
"env":"Linux-sercice-app1-server"
}
},
{
"targets":[
"173.0.1.83:9100"
],
"labels":{
"app":"@ubuntu-173.0.1.83-Linux",
"hostname":"test5",
"env":"Linux-sercice-app2-server"
}
}
]
5.现在 prometheus 开始收集 监控目标的指标数据了,如下:
- 现在 已经获取到监控客户端的数据了,有监控 就一定有报警,不然 就不完美了 。 接下来 我们 配置 alertmanger
配置 alertmanger .yml 文件
global:
resolve_timeout: 5m
# smtp_from: '{{ template "email.from" . }}'
smtp_from: 'haoyacong@gimmake.com'
smtp_smarthost: 'smtp.mxhichina.com:465'
# smtp_auth_username: '{{ template "email.from" . }}'
smtp_auth_username: 'haoyacong@gimmake.com'
smtp_auth_password: 'Haoyacong515'
# smtp_auth_password: 'dpbbqhaxwltxdcia'
smtp_require_tls: false
# smtp_hello: 'mailsso.mxhichina'
templates:
--邮件自定义模板
- 'email.tmpl'
route:
group_by: ['alertname']
group_wait: 15s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '2716966498@qq.com , weilina@gimmake.com'
# - to: '{{ template "email.to" . }}'
#
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
email.tmpl 文件
{{ define "email.from" }}2716966498@qq.com{{ end }}
{{ define "email.to" }}2716966498@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
=========end==========<br>
{{ end }}
{{ end }}
有了 告警 发件人和收件人 ,现在 我们配置一下 触发告警的规则
告警规则文件 (常用的规则) 官方的规则
groups:
- name: node-up.yml
rules:
- alert: Linux_cpu
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 90 # 自定义摘要 10
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 主机内存不足 (剩余<10%)" # 自定义摘要 Linux
- alert: Prometheus_task_down
expr: absent(up{job="my-job"})
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Prometheus 的工作宕机" # 自定义摘要
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 5m
labels:
severity: warning
annotations:
summary: " Prometheus Service 无法连接到报警器" # 自定义摘要
- alert: PrometheusConfigurationReloadFailure
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: " Prometheus Service 配置重新加载失败" # 自定义摘要
- alert: service_down
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 目标 exporter 意外宕机!!" # 自定义摘要
- alert: Linux_internal
expr: rate(node_vmstat_pgmajfault[1m]) > -1 # 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机内存承受内存压力过大" # 自定义摘要 Linux
- alert: Linux_pull_datas
expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机网络接口可能接收到太多数据(> 100 MB / s)" # 自定义摘要 Linux
- alert: Linux_push_datas
expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 0 # 100
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 主机网络接口可能正在发送过多数据(> 100 MB / s)" # 自定义摘要 Linuxs
# - alert: windows_exporter_down
# expr: windows_exporter_collector_success == 0
# for: 5m
# labels:
# severity: warning
# annotations:
# summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} exporter 服务关闭" # 自定义摘要
- alert: windows_cpu
expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 10 # 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 CPU使用率超过80%" # 自定义摘要
- alert: windows_internal
expr: 100 * (windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes >20 # 90
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 内存使用率超过90%" # 自定义摘要
- alert: windows_panl
expr: 100.0 - 100 * ((windows_logical_disk_free_bytes{} / 1024 / 1024 ) / (windows_logical_disk_size_bytes{} / 1024 / 1024)) > 50 # 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} :{{ $labels.mountpoint }} 服务器 {{ $labels.volume }} 磁盘使用率超过80%" # 自定义摘要
注意 配置 yml配置文件时 不可以使用tab键 要用空格键
- 告警和 Service 端 已经配置好了,prometheus 也有自带的图像 但是 不太美观 ,我们使用 Grafana 工具
点击 grafana-server.exe 文件后 在本地 访问 http://localhost:3000
账户,密码 默认是 admin
8. 配置 Grafana 数据源为prometheus
9.配置数据源后,我们 添加 不同环境的监控指标数据的页面模板
node_exporter 的指标数据的模板
windows_exporter的指标数据的模板
可以去 官网查询 合适的模板。
配置 模板 如下