promethues-简介

博主： xinyu_he
发布时间：2025 年 04 月 01 日
37 次浏览
暂无评论
13113字数
分类： prometheus

# Prometheus简介:

Prometheus是基于Go语言开发的一套开源的监控、 报警和时间序列数据库的组合， 是由SoundCloud公司开发(2012年)的开源监控系统， Prometheus于2016年加入CNCF（ Cloud Native Computing Foundation,云原生计算基金会） ,2018年8月9日prometheus成为CNCF继kubernetes 之后毕业的第二个项目， prometheus在容器和微服务领域中得到了广泛的应用， 其主要优缺点如下：

* 使用key-value的多维度(多个角度， 多个层面， 多个方面)格式保存数据
* 数据不使用MySQL这样的传统数据库， 而是使用时序数据库， 目前是使用的TSDB
* 支持第三方dashboard实现更绚丽的图形界面， 如grafana(Grafana 2.5.0版本及以上)
* 组件模块化
* 不需要依赖存储， 数据可以本地保存也可以远程保存
* 平均每个采样点仅占3.5 bytes， 且一个Prometheus server可以处理数百万级别的的metrics指标数据。
* 支持服务自动化发现(基于consul等方式动态发现被监控的目标服务)
* 强大的数据查询语句功(PromQL,Prometheus Query Language)
* 数据可以直接进行算术运算
* 易于横向伸缩
* 众多官方和第三方的exporter(“ 数据” 导出器)实现不同的指标数据收集

CNCF 基金会已经毕业的项目： [https://www.cncf.io/projects](https://www.cncf.io/projects)

# Prometheus 架构：

* prometheus server： 主服务， 接受外部http请求、 收集指标数据、 存储指标数据与查询指标数据等。
* prometheus targets: 静态发现目标后执行指标数据抓取。
* service discovery： 动态发现目标后执行数据抓取。
* prometheus alerting： 调用alertmanager组件实现报警通知。
* push gateway： 数据收集代理服务器(类似于zabbix proxy但仅限于client主动push数据至push gateway)。
* data visualization and export：数据可视化与数据导出(浏览器或其它client)。

[![Prometheus 架构图](https://shackles.cn/Learning_pictures/Prometheus/Prometheus-JGT.jpg "Prometheus 架构图")](https://shackles.cn/Learning_pictures/Prometheus/Prometheus-JGT.jpg)

Prometheus 架构图

# 数据采集流程、 TSDB简介；

## Prometheus数据采集流程:

* 基于静态配置文件或动态发现获取目标
* 向目标URL发起http/https请求
* 目标接受请求并返回指标数据
* prometheus server接受并数据并对比告警规则， 如果触发告警则进一步执行告警动作并存储数据， 不触发告警则只进行数据存储
* grafana进行数据可视化

[![数据采集流程](https://oss.shackles.cn/Prometheus/data_metrics_pull.jpg "数据采集流程")](https://oss.shackles.cn/Prometheus/data_metrics_pull.jpg)

数据采集流程

## TSDB简介及特点

### TSDB简介:

* Prometheus有着非常高效的时间序列数据存储方法， 每个采样数据仅仅占用3.5byte左右空间， 上百万条时间序列， 30秒间隔， 保留60天，大概200多G空间（ 引用官方资料） 。
* 默认情况下， prometheus将采集到的数据存储在本地的TSDB数据库中， 路径默认为prometheus安装目录的data目录， 数据写入过程为先把数据写入wal日志并放在内存， 然后2小时后将内存数据保存至一个新的block块， 同时再把新采集的数据写入内存并在2小时后再保存至一个新block块，以此类推。
* prometheus先将采集的指标数据保存到内存的chunk中， chunk是prometheus存储数据的最基本单元。
* 每间隔两个小时， 将当前内存的多个chunk统一保存至一个block中并进行数据合并、 压缩、 并生成元数据文件index、 meta.json和tombstones

阿里云的商业T时序数据库产品

```
https://www.aliyun.com/product/hitsdb
```

[![TSDB DATA图](https://shackles.cn/Learning_pictures/Prometheus/TSDB1.jpg "TSDB DATA图")](https://shackles.cn/Learning_pictures/Prometheus/TSDB1.jpg)

TSDB DATA图

### TSDB特点

* TSDB： Time Series Database , 简称 TSDB， 存放时间序列数据的数据库
* 时间序列数据具有不变性、 唯一性和按照时间排序的特性。
* 持续周期性写入数据、 高并发吞吐： 每间隔一段时间，就会写入成千上万的节点的指标数据。
* 写多读少： prometheus每间隔15s就会采集数十万或更多指标数据， 但通常只查看最近比较重要的指标数据。
* 数据按照时间排列： 每次收集的指标数据， 写入时都是按照当前时间往后进行写入， 不会覆盖历史数据。
* 数据量大： 历史数据会有数百G甚至数百T或更多。
* 时效性： 只保留最近一段时间的数据， 超出时效的数据会被删除。
* 冷热数据分明： 通常只查看最近的热数据， 以往的冷数据很少查看。

### TSDB-block特性：

block会压缩、 合并历史数据块， 以及删除过期的块， 随着压缩、 合并， block的数量会减少， 在压缩过程中会发生三件事： 定期执行压缩、 合并小的block到大的block、 清理过期的块， 每个块有4部分组成：

```
tree /apps/prometheus/data/01FQNCYZ0BPFA8AQDDZM1C5PRN/
/apps/prometheus/data/01FQNCYZ0BPFA8AQDDZM1C5PRN/
├── chunks
│ └── 000001    #数据目录,每个大小为512MB超过会被切分为多个
├── index       #索引文件， 记录存储的数据的索引信息， 通过文件内的几个表来查找时序数据
├── meta.json   #block元数据信息， 包含了样本数、 采集数据数据的起始时间、 压缩历史
└── tombstones  #逻辑数据， 主要记载删除记录和标记要删除的内容， 删除标记， 可在查询块时排除样本。
```

### TSDB-block简介：

每个block为一个data目录中以01开头的存储目录， 如下：

```
ls -l /apps/prometheus/data/
total 4
drwxr-xr-x 3 root root    68 Oct 10 19:01 01HCCKYCZXW40V7KQP295KK2TD #block
drwxr-xr-x 3 root root    68 Oct 13 01:02 01HCJDAH1WM0EQGA5H0Q9FYANY #block
drwxr-xr-x 3 root root    68 Oct 15 07:02 01HCR6PW52WZ45K8YF4XWCFFPA #block
```

[![TSDB 存储目录](https://shackles.cn/Learning_pictures/Prometheus/TSDB2.jpg "TSDB 存储目录")](https://shackles.cn/Learning_pictures/Prometheus/TSDB2.jpg)

TSDB 存储目录

# PromQL语句-指标数据、 数据类型、 匹配器；

## PromQL简介：

Prometheus提供一个函数式的表达式语言PromQL (Prometheus Query Language)， 可以使用户实时地查找和聚合时间序列数据， 表达式计算结果可以在图表中展示， 也可以在Prometheus表达式浏览器中以表格形式展示， 或者作为数据源, 以HTTP API的方式提供给外部系统使用。

```
https://prometheus.io/docs/prometheus/latest/querying/basics
```

[![PromQL](https://shackles.cn/Learning_pictures/Prometheus/PromQL1.jpg "PromQL")](https://shackles.cn/Learning_pictures/Prometheus/PromQL1.jpg)

PromQL

## PromQL查询数据类型：

### Instant Vector： 瞬时向量/瞬时数据,是对目标实例查询到的同一个时间戳的一组时间序列数据(按照时间的推移对数据进存储和展示)， 每个时间序列包含单个数据样本， 比如node\_memory\_MemFree\_bytes查询的是当前剩余内存(可用内存)就是一个瞬时向量， 该表达式的返回值中只会包含该时间序列中的最新的一个样本值， 而相应的这样的表达式称之为瞬时向量表达式。

以下是查询node节点可用内存的瞬时向量表达式：

```
root@prometheus-server:~# curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=node_memory_MemFree_bytes' --data time=1697699171

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"node_memory_MemFree_bytes","country":"中国上海","instance":"10.2.0.21:9100","job":"prometheus-ShangHai"},"value":[1697699171,"1202761728"]}]}}
```

### Range Vector： 范围向量/范围数据,是指在任何一个时间范围内， 抓取的所有度量指标数据.比如最近一天的网卡流量趋势图、 或最近5分钟的node节点内容可用字节数等。

以下是查询node节点可用内存的范围向量表达式：

```
root@prometheus-server:~# curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=node_memory_MemFree_bytes{instance="10.2.0.21:9100"}[5m]' --data time=1697699171

{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"__name__":"node_memory_MemFree_bytes","country":"中国上海","instance":"10.2.0.21:9100","job":"prometheus-ShangHai"},"values":[[1697698872.270,"1202761728"],[1697698887.269,"1202761728"],[1697698902.270,"1202761728"],[1697698917.269,"1202761728"],[1697698932.269,"1202761728"],[1697698947.269,"1202761728"],[1697698962.269,"1202761728"],[1697698977.269,"1202761728"],[1697698992.269,"1202761728"],[1697699007.269,"1202761728"],[1697699022.269,"1202761728"],[1697699037.269,"1202761728"],[1697699052.269,"1202761728"],[1697699067.270,"1202761728"],[1697699082.269,"1202761728"],[1697699097.270,"1202761728"],[1697699112.269,"1202761728"],[1697699127.269,"1202761728"],[1697699142.269,"1202761728"],[1697699157.270,"1202761728"]]}]}}
```

## Instant Vector（瞬时向量） VS Range Vector（范围向量）:

instant vector（瞬时向量）：每个指标只含有一个数据
range vector（范围向量）：每个指标含有一组数据（例如指定最近几分钟的数据）

[![瞬时向量VS范围向量](https://shackles.cn/Learning_pictures/Prometheus/Instant_Vector_VS_Range_Vector.jpg "瞬时向量VS范围向量")](https://shackles.cn/Learning_pictures/Prometheus/Instant_Vector_VS_Range_Vector.jpg)

瞬时向量VS范围向量

### scalar： 标量/纯量数据,是一个浮点数类型的数据值， 使用node\_load1获取到一个瞬时向量后， 再使用prometheus的内置函数scalar()将瞬时向量转换为标量。

例如： scalar(sum(node\_load1))

```
root@prometheus-server:~#curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=scalar(sum(node_load1{instance="10.2.0.21:9100"}))' --data time=1697699171

{"status":"success","data":{"resultType":"scalar","result":[1697699171,"0"]}}
```

[![scalar](https://shackles.cn/Learning_pictures/Prometheus/scalar.png "scalar")](https://shackles.cn/Learning_pictures/Prometheus/scalar.png)

scalar

# Prometheus指标数据类型：

[![Prometheus_metrics](https://shackles.cn/Learning_pictures/Prometheus/Prometheus_metrics.jpg "Prometheus_metrics")](https://shackles.cn/Learning_pictures/Prometheus/Prometheus_metrics.jpg)

Prometheus_metrics

* **Counter**:计数器,Counter类型代表一个累积的指标数据， 在没有被重启的前提下只增不减(生活中的电表、 水表)， 比如磁盘I/O总数、 Nginx/API的请求总
  数、 网卡流经的报文总数等。
* **Gauge**:仪表盘,Gauge类型代表一个可以任意变化的指标数据， 值可以随时增高或减少， 如带宽速率、 CPU负载、 内存利用率、 nginx 活动连接数等。
* **Histogram**： 累积直方图， Histogram会在一段时间范围内对数据进行采样(通常是请求持续时间或响应大小等),假如每分钟产生一个当前的活跃连接数， 那么一天24小时\*60分钟=1440分钟就会产生1440个数据， 查看数据的每间隔的绘图跨度为2小时， 那么2点的柱状图(bucket)会包含0点到2点即两个小时的数据， 而4点的柱状图(bucket)则会包含0点到4点的数据， 而6点的柱状图(bucket)则会包含0点到6点的数据， 可用于统计从当天零点开始到当前时间的数据统计结果， 如http请求成功率、 丢包率等， 比如ELK的当天访问IP统计。
* **Summary**： 摘要图， 也是一组数据， 默认统计选中的指标的最近10分钟内的数据的分位数， 可以指定数据统计时间范围， 基于分位数(Quantile),亦称分位
  点,是指用分割点(cut point)将随机数据统计并划分为几个具有相同概率的连续区间， 常见的为四分位， 四分位数是将数据样本统计后分成四个区间， 将范围内的数据进行百分比的占比统计,从0到1， 表示是0%\~100%， (0%\~25%,%25\~50%,50%\~75%,75%\~100%),利用四分位数， 可以快速了解数据的大概统计结果。

## node-exporter指标数据格式：

没有标签的

```
#metric_name metric_value
# TYPE node_load15 gauge
node_load15 0.1
```

一个标签的

```
#metric_name{label1_name="label1-value"} metric_value
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="eth0"} 1.44096e+07
```

多个标签的

```
#metric_name{label1_name="label1-value","labelN_name="labelN-value} metric_value
# TYPE node_filesystem_files_free gauge
node_filesystem_files_free{device="/dev/sda2",fstype="xfs",mountpoint="/boot"} 523984
```

## PromQL查询指标数据示例：

* node\_memory\_MemTotal\_bytes #查询node节点总内存大小
* node\_memory\_MemFree\_bytes #查询node节点剩余可用内存
* node\_memory\_MemTotal\_bytes{instance="10.2.0.21:9100"} #基于标签查询指定节点的总内存
* node\_memory\_MemFree\_bytes{instance="10.2.0.21:9100"} #基于标签查询指定节点的可用内存
* node\_disk\_io\_time\_seconds\_total{device="sda"} #查询指定磁盘的每秒磁盘io
* node\_filesystem\_free\_bytes{device="/dev/sda1",fstype="xfs",mountpoint="/"} #查看指定磁盘的磁盘剩余空间

## 基于标签对指标数据进行匹配：

* = :选择与提供的字符串完全相同的标签， 精确匹配。
* != :选择与提供的字符串不相同的标签， 取反。
* =\~ :选择正则表达式与提供的字符串（ 或子字符串） 相匹配的标签。
* !\~ :选择正则表达式与提供的字符串（ 或子字符串） 不匹配的标签。

查询格式<metric name>{<label name>=<label value>, ...}

```
node_load1{instance="10.2.0.21:9100"}
node_load1{country="中国上海"}
node_load1{country="中国上海", instance="10.2.0.21:9100"} #精确匹配
node_load1{country="中国上海",instance!="10.2.0.21:9100"} #取反
node_load1{instance=~"10.2.0.2.*:9100$"} #包含正则且匹配
node_load1{instance!~"10.2.0.21:9100"} #包含正则且取反
```

[![Metric_format](https://shackles.cn/Learning_pictures/Prometheus/Metric_format.png "Metric_format")](https://shackles.cn/Learning_pictures/Prometheus/Metric_format.png)

Metric_format

# PromQL语句-时间范围、 运算符、 聚合运算及示例；

## 对指标数据进行时间范围指定:

* s - 秒
* m - 分钟
* h - 小时
* d - 天
* w - 周
* y - 年

瞬时向量表达式， 选择当前最新的数据

```
node_memory_MemTotal_bytes{}
```

区间向量表达式， 选择以当前时间为基准， 查询所有节点node\_memory\_MemTotal\_bytes指标5分钟内的数据

```
node_memory_MemTotal_bytes{}[5m]
```

区间向量表达式， 选择以当前时间为基准， 查询指定节点node\_memory\_MemTotal\_bytes指标5分钟内的数据

```
node_memory_MemTotal_bytes{instance="172.31.1.181:9100"}[5m]
```

## PromQL 运算符：

### 对指标数据进行数学运算：

```
+ 加法
- 减法
* 乘法
/ 除法
% 模
^ 幂(N次方)
```

node\_memory\_MemFree\_bytes/1024/1024 #将内存进行单位从字节转行为兆
node\_disk\_read\_bytes\_total{device="sda"} + node\_disk\_written\_bytes\_total{device="sda"} #计算磁盘读写数据量
(node\_disk\_read\_bytes\_total{device="sda"} + node\_disk\_written\_bytes\_total{device="sda"}) / 1024 / 1024 #单位转换

[![Operational_examples](https://shackles.cn/Learning_pictures/Prometheus/Operational_examples.png "Operational_examples")](https://shackles.cn/Learning_pictures/Prometheus/Operational_examples.png)

Operational_examples

### 对指标数据进行进行聚合运算：

* max() #最大值
* min() #最小值
* avg() #平均值

#### 计算每个节点的最大的流量值：

```
max(node_network_receive_bytes_total) by (instance)
```

#### 计算每个节点最近五分钟每个device的最大流量

```
max(rate(node_network_receive_bytes_total[5m])) by (device)
```

#### sum() #求数据值相加的和(总数)

```
sum(prometheus_http_requests_total)
{} 2495
```

最近总共请求数为2495次， 用于计算返回值的总数(如http请求次数)

#### count() #统计返回值的条数

```
count(node_os_version)
{} 3
```

一共两条返回的数据， 可以用于统计节点数、 pod数量等

#### count\_values() #对value的个数(行数)进行计数,并将value赋值给自定义标签， 从而成为新的label

```
count_values("node_version",node_os_version) #统计不同的系统版本节点有多少
{node_version="22.04"} 3
```

#### abs() #返回指标数据的值

```
abs(sum(prometheus_http_requests_total{handler="/metrics"}))
```

#### absent() #如果监指标有数据就返回空， 如果监控项没有数据就返回1， 可用于对监控项设置告警通知(如果返回值等于1就触发告警通知)

```
absent(sum(prometheus_http_requests_total{handler="/metrics"}))
```

#### stddev() #标准差

```
stddev(prometheus_http_requests_total) #5+5=10,1+9=10,1+9这一组的数据差异就大， 在系统是数据波动较大， 不稳定
```

#### stdvar() #求方差

```
stdvar(prometheus_http_requests_total)
```

#### topk() #样本值排名最大的N个数据

举例取从大到小的前6个

```
topk(6, prometheus_http_requests_total)
```

#### bottomk() #样本值排名最小的N个数据

举例取从小到大的前6个

```
bottomk(6, prometheus_http_requests_total)
```

#### rate()

rate函数是专门搭配counter数据类型使用函数， rate会取指定时间范围内所有数据点， 算出一组速率， 然后取平均值作为结果,适合用于计算数据相对平稳的数据。

```
rate(prometheus_http_requests_total[5m])
rate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
rate(node_network_receive_bytes_total[5m])
```

#### irate()

函数也是专门搭配counter数据类型使用函数，irate取的是在指定时间范围内的最近两个数据点来算速率，适合计算数据变化比较大的数据，显示的数据相对比较准确,所以官网文档说：irate适合快速变化的计数器（counter），而rate适合缓慢变化的计数器（counter）。

```
irate(prometheus_http_requests_total[5m])
irate(node_network_receive_bytes_total[5m])
irate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
```

#### by

在计算结果中， 只保留by指定的标签的值， 并移除其它所有的

```
sum(rate(node_network_receive_packets_total{instance=~".*"}[10m])) by (instance)
sum(rate(node_memory_MemFree_bytes[5m])) by (increase)
```

without， 从计算结果中移除列举的instance,job标签， 保留其它标签

```
sum(prometheus_http_requests_total) without (instance,job)
```

# Prometheus pushgateway：

## Pushgateway 简介：

* pushgateway用于临时的指标数据收集。
* pushgateway不支持数据拉取(pull模式)， 需要客户端主动将数据推送给pushgateway。
* pushgateway可以单独运行在一个节点， 然后需要自定义监控脚本把需要监控的主动推送给pushgateway的API接口， 然后pushgateway再等待prometheus server抓取数据， 即pushgateway本身没有任何抓取监控数据的功能，目前pushgateway只能被动的等待数据从客户端进行推送。
* --persistence.file="" #数据保存的文件， 默认只保存在内存中
* --persistence.interval=5m #数据持久化的间隔时间

## 客户端推送单条指标数据和Pushgateway 数据采集流程:

要手动Push数据到 PushGateway中， 可以通过其提供的 API 标准接口来添加， 默认 URL 地址为：[http://[](http://%3C/);ip](%5D(http://%3C/);ip):9091/metrics/job/<JOBNAME{/<LABEL\_NAME>/<LABEL\_VALUE>}

<JOBNAME>是必填项，是job的名称，后边可以跟任意数量的标签对，一般会添加一个instance/<INSTANCE\_NAME>实例名称标签， 来方便区分各个指标是在哪个节点产生的。
如下推送一个job名称为mytest\_job， key为mytest\_metric值为2022

```
echo "mytest_metric 2088" | curl --data-binary @- http://10.2.0.24:9091/metrics/job/mytest_job
```

[![Pushgateway_flowchart](https://shackles.cn/Learning_pictures/Prometheus/Pushgateway_flowchart.jpg "Pushgateway_flowchart")](https://shackles.cn/Learning_pictures/Prometheus/Pushgateway_flowchart.jpg)

Pushgateway_flowchart

## 部署Pushgateway：

```
root@prometheus-pushgateway:/apps# tar xvf pushgateway-1.6.2.linux-amd64.tar.gz
root@prometheus-pushgateway:/apps# ln -sv /apps/pushgateway-1.6.2.linux-amd64 /apps/pushgateway
root@prometheus-pushgateway:/apps# cat /etc/systemd/system/pushgateway.service
[Unit]
Description=Prometheus pushgateway
After=network.target

[Service]
ExecStart=/apps/pushgateway/pushgateway

[Install]
WantedBy=multi-user.target

root@prometheus-pushgateway:/apps/pushgateway# systemctl daemon-reload && systemctl start pushgateway && systemctl enable pushgateway
```

## 验证Pushgateway：

默认监听在9091端口，可以通过[http://10.2.0.24](http://10.2.0.24/):9091/metrics对外提供指标数据抓取接口

[![pushgateway_ui](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_ui.png "pushgateway_ui")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_ui.png)

pushgateway_ui

除了我们手动push的指标数据自身以外， pushgateway还为每一条指标数据附加了push\_time\_seconds 和 push\_failure\_time\_seconds 两个指标，这两个是 PushGateway 自动生成的, 分别用于记录指标数据的成功上传时间和失败上传时间。
[![push_time_seconds&push_failure_time_seconds](https://shackles.cn/Learning_pictures/Prometheus/push_time_seconds&push_failure_time_seconds.png "push_time_seconds&push_failure_time_seconds")](https://shackles.cn/Learning_pictures/Prometheus/push_time_seconds&push_failure_time_seconds.png)

push_time_seconds&push_failure_time_seconds

## 配置Prometheus-server数据采集：

```
root@prometheus-server:/apps/prometheus# vim prometheus.yml
- job_name: 'prometheus-pushgateway'
  scrape_interval: 5s
  honor_labels: true
  static_configs:
    - targets: ['10.2.0.24:9091']
root@prometheus-server1:/apps/prometheus# systemctl restart prometheus.service
```

## prometheus-server 验证指标数据：

[![pushgateway_data](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png "pushgateway_data")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png)

pushgateway_data

## 客户端推送多条数据-方式一：

```
root@prometheus-node1:~# cat <<EOF | curl --data-binary @- http://10.2.0.24:9091/metrics/job/test_job/instance/10.2.0.24
#TYPE node_memory_usage gauge
node_memory_usage 4311744512
# TYPE memory_total gauge
node_memory_total 103481868288
EOF
```

## 客户端推送多条数据-方式二：

基于自定义脚本实现数据的收集和推送：

```
root@prometheus-node1:~# cat memory_monitor.sh
#!/bin/bash
total_memory=$(free |awk '/Mem/{print $2}')
used_memory=$(free |awk '/Mem/{print $3}')
job_name="custom_memory_monitor"
instance_name=`ifconfig eth0 | grep -w inet | awk '{print $2}'`
pushgateway_server="http://10.2.0.24:9091/metrics/job"
cat <<EOF | curl --data-binary @- ${pushgateway_server}/${job_name}/instance/${instance_name}
#TYPE custom_memory_total gauge
custom_memory_total $total_memory
#TYPE custom_memory_used gauge
custom_memory_used $used_memory
EOF
```

分别在不同主机执行脚本， 验证指标数据收集和推送：

```
root@prometheus-node1:~# bash memory_monitor.sh
root@prometheus-node2:~# bash memory_monitor.sh
```

验证prometheus-server能否抓取pushgateway的数据：

[![pushgateway_data](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png "pushgateway_data")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png)

pushgateway_data

## Pushgateway指标数的删除：

1、通过API删除：

```
root@prometheus-node2:~# curl -X DELETE http://10.2.0.24:9091/metrics/job/custom_memory_monitor/instance/10.2.0.24
```

2、通过控制台删除
[![delete_pushgateway](https://shackles.cn/Learning_pictures/Prometheus/delete_pushgateway.png "delete_pushgateway")](https://shackles.cn/Learning_pictures/Prometheus/delete_pushgateway.png)

delete_pushgateway

# Prometheus Federation(联邦集群)：

10.2.0.18收集10.5.0.21（ShangHai）节点数据，10.2.0.19收集10.2.0.22（BeiJing）节点数据，10.2.0.20收集10.2.0.23（ShenZhen）数据。10.2.0.17通过联邦模式（/federate）抓取三个Server抓取到的指标也就是ShangHai，BeiJing，ShenZhen三个node节点的指标信息。
[![Federation](https://shackles.cn/Learning_pictures/Prometheus/Federation.png "Federation")](https://shackles.cn/Learning_pictures/Prometheus/Federation.png)

Federation

## 部署Prometheus Server和node\_exporter的步骤

上方有，在此就不做过多介绍，详情请查看上方二进制安装

## 配置Prometheus(10.2.0.17)联邦节点收集node-exporter指标数据：

```
- job_name: 'prometheus-federate-2.0.18'
    scrape_interval: 10s
    honor_labels: true
    metrics_path: '/federate'
    params:
    'match[]':
    - '{job="prometheus-ShangHai"}'
    - '{__name__=~"job:.*"}'
    - '{__name__=~"node.*"}'
    static_configs:
    - targets:
    - '10.2.0.18:9090'
- job_name: 'prometheus-federate-2.0.19'
    scrape_interval: 10s
    honor_labels: true
    metrics_path: '/federate'
    params:
    'match[]':
    - '{job="prometheus-BeiJing"}'
    - '{__name__=~"job:.*"}'
    - '{__name__=~"node.*"}'
    static_configs:
    - targets:
    - '10.2.0.19:9090'
- job_name: 'prometheus-federate-2.0.20'
    scrape_interval: 10s
    honor_labels: true
    metrics_path: '/federate'
    params:
    'match[]':
    - '{job="prometheus-ShenZhen"}'
    - '{__name__=~"job:.*"}'
    - '{__name__=~"node.*"}'
    static_configs:
    - targets:
    - '10.2.0.20:9090'
root@prometheus-server3:/apps/prometheus# systemctl restart prometheus.service
```

## 验证prometheus targets状态：

[![federate_targets](https://shackles.cn/Learning_pictures/Prometheus/federate_targets.png "federate_targets")](https://shackles.cn/Learning_pictures/Prometheus/federate_targets.png)

federate_targets

## 验证prometheus 通过联邦节点收集的node-exporter指标数据:

[![federate_date](https://shackles.cn/Learning_pictures/Prometheus/federate_date.png "federate_date")](https://shackles.cn/Learning_pictures/Prometheus/federate_date.png)

federate_date

最后修改：2025 年 05 月 04 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱

地址

promethues-简介

xinyu_he • 2025 年 04 月 01 日

# Prometheus简介:

CNCF 基金会已经毕业的项目： [https://www.cncf.io/projects](https://www.cncf.io/projects)

# Prometheus 架构：

[![Prometheus 架构图](https://shackles.cn/Learning_pictures/Prometheus/Prometheus-JGT.jpg "Prometheus 架构图")](https://shackles.cn/Learning_pictures/Prometheus/Prometheus-JGT.jpg)

Prometheus 架构图

# 数据采集流程、 TSDB简介；

## Prometheus数据采集流程:

[![数据采集流程](https://oss.shackles.cn/Prometheus/data_metrics_pull.jpg "数据采集流程")](https://oss.shackles.cn/Prometheus/data_metrics_pull.jpg)

数据采集流程

## TSDB简介及特点

### TSDB简介:

阿里云的商业T时序数据库产品

```
https://www.aliyun.com/product/hitsdb
```

[![TSDB DATA图](https://shackles.cn/Learning_pictures/Prometheus/TSDB1.jpg "TSDB DATA图")](https://shackles.cn/Learning_pictures/Prometheus/TSDB1.jpg)

TSDB DATA图

### TSDB特点

### TSDB-block特性：

### TSDB-block简介：

每个block为一个data目录中以01开头的存储目录， 如下：

[![TSDB 存储目录](https://shackles.cn/Learning_pictures/Prometheus/TSDB2.jpg "TSDB 存储目录")](https://shackles.cn/Learning_pictures/Prometheus/TSDB2.jpg)

TSDB 存储目录

# PromQL语句-指标数据、 数据类型、 匹配器；

## PromQL简介：

```
https://prometheus.io/docs/prometheus/latest/querying/basics
```

[![PromQL](https://shackles.cn/Learning_pictures/Prometheus/PromQL1.jpg "PromQL")](https://shackles.cn/Learning_pictures/Prometheus/PromQL1.jpg)

PromQL

## PromQL查询数据类型：

以下是查询node节点可用内存的瞬时向量表达式：

```
root@prometheus-server:~# curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=node_memory_MemFree_bytes' --data time=1697699171

以下是查询node节点可用内存的范围向量表达式：

```
root@prometheus-server:~# curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=node_memory_MemFree_bytes{instance="10.2.0.21:9100"}[5m]' --data time=1697699171

## Instant Vector（瞬时向量） VS Range Vector（范围向量）:

instant vector（瞬时向量）：每个指标只含有一个数据
range vector（范围向量）：每个指标含有一组数据（例如指定最近几分钟的数据）

瞬时向量VS范围向量

例如： scalar(sum(node\_load1))

```
root@prometheus-server:~#curl 'http://10.2.0.18:9090/api/v1/query' --data 'query=scalar(sum(node_load1{instance="10.2.0.21:9100"}))' --data time=1697699171

{"status":"success","data":{"resultType":"scalar","result":[1697699171,"0"]}}
```

[![scalar](https://shackles.cn/Learning_pictures/Prometheus/scalar.png "scalar")](https://shackles.cn/Learning_pictures/Prometheus/scalar.png)

scalar

# Prometheus指标数据类型：

[![Prometheus_metrics](https://shackles.cn/Learning_pictures/Prometheus/Prometheus_metrics.jpg "Prometheus_metrics")](https://shackles.cn/Learning_pictures/Prometheus/Prometheus_metrics.jpg)

Prometheus_metrics

## node-exporter指标数据格式：

没有标签的

```
#metric_name metric_value
# TYPE node_load15 gauge
node_load15 0.1
```

一个标签的

```
#metric_name{label1_name="label1-value"} metric_value
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="eth0"} 1.44096e+07
```

多个标签的

## PromQL查询指标数据示例：

## 基于标签对指标数据进行匹配：

查询格式<metric name>{<label name>=<label value>, ...}

[![Metric_format](https://shackles.cn/Learning_pictures/Prometheus/Metric_format.png "Metric_format")](https://shackles.cn/Learning_pictures/Prometheus/Metric_format.png)

Metric_format

# PromQL语句-时间范围、 运算符、 聚合运算及示例；

## 对指标数据进行时间范围指定:

* s - 秒
* m - 分钟
* h - 小时
* d - 天
* w - 周
* y - 年

瞬时向量表达式， 选择当前最新的数据

```
node_memory_MemTotal_bytes{}
```

区间向量表达式， 选择以当前时间为基准， 查询所有节点node\_memory\_MemTotal\_bytes指标5分钟内的数据

```
node_memory_MemTotal_bytes{}[5m]
```

区间向量表达式， 选择以当前时间为基准， 查询指定节点node\_memory\_MemTotal\_bytes指标5分钟内的数据

```
node_memory_MemTotal_bytes{instance="172.31.1.181:9100"}[5m]
```

## PromQL 运算符：

### 对指标数据进行数学运算：

```
+ 加法
- 减法
* 乘法
/ 除法
% 模
^ 幂(N次方)
```

[![Operational_examples](https://shackles.cn/Learning_pictures/Prometheus/Operational_examples.png "Operational_examples")](https://shackles.cn/Learning_pictures/Prometheus/Operational_examples.png)

Operational_examples

### 对指标数据进行进行聚合运算：

* max() #最大值
* min() #最小值
* avg() #平均值

#### 计算每个节点的最大的流量值：

```
max(node_network_receive_bytes_total) by (instance)
```

#### 计算每个节点最近五分钟每个device的最大流量

```
max(rate(node_network_receive_bytes_total[5m])) by (device)
```

#### sum() #求数据值相加的和(总数)

```
sum(prometheus_http_requests_total)
{} 2495
```

最近总共请求数为2495次， 用于计算返回值的总数(如http请求次数)

#### count() #统计返回值的条数

```
count(node_os_version)
{} 3
```

一共两条返回的数据， 可以用于统计节点数、 pod数量等

#### count\_values() #对value的个数(行数)进行计数,并将value赋值给自定义标签， 从而成为新的label

```
count_values("node_version",node_os_version) #统计不同的系统版本节点有多少
{node_version="22.04"} 3
```

#### abs() #返回指标数据的值

```
abs(sum(prometheus_http_requests_total{handler="/metrics"}))
```

#### absent() #如果监指标有数据就返回空， 如果监控项没有数据就返回1， 可用于对监控项设置告警通知(如果返回值等于1就触发告警通知)

```
absent(sum(prometheus_http_requests_total{handler="/metrics"}))
```

#### stddev() #标准差

```
stddev(prometheus_http_requests_total) #5+5=10,1+9=10,1+9这一组的数据差异就大， 在系统是数据波动较大， 不稳定
```

#### stdvar() #求方差

```
stdvar(prometheus_http_requests_total)
```

#### topk() #样本值排名最大的N个数据

举例取从大到小的前6个

```
topk(6, prometheus_http_requests_total)
```

#### bottomk() #样本值排名最小的N个数据

举例取从小到大的前6个

```
bottomk(6, prometheus_http_requests_total)
```

#### rate()

```
rate(prometheus_http_requests_total[5m])
rate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
rate(node_network_receive_bytes_total[5m])
```

#### irate()

```
irate(prometheus_http_requests_total[5m])
irate(node_network_receive_bytes_total[5m])
irate(apiserver_request_total{code=~"^(?:2..)$"}[5m])
```

#### by

在计算结果中， 只保留by指定的标签的值， 并移除其它所有的

```
sum(rate(node_network_receive_packets_total{instance=~".*"}[10m])) by (instance)
sum(rate(node_memory_MemFree_bytes[5m])) by (increase)
```

without， 从计算结果中移除列举的instance,job标签， 保留其它标签

```
sum(prometheus_http_requests_total) without (instance,job)
```

# Prometheus pushgateway：

## Pushgateway 简介：

## 客户端推送单条指标数据和Pushgateway 数据采集流程:

```
echo "mytest_metric 2088" | curl --data-binary @- http://10.2.0.24:9091/metrics/job/mytest_job
```

Pushgateway_flowchart

## 部署Pushgateway：

[Service]
ExecStart=/apps/pushgateway/pushgateway

[Install]
WantedBy=multi-user.target

root@prometheus-pushgateway:/apps/pushgateway# systemctl daemon-reload && systemctl start pushgateway && systemctl enable pushgateway
```

## 验证Pushgateway：

默认监听在9091端口，可以通过[http://10.2.0.24](http://10.2.0.24/):9091/metrics对外提供指标数据抓取接口

[![pushgateway_ui](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_ui.png "pushgateway_ui")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_ui.png)

pushgateway_ui

push_time_seconds&push_failure_time_seconds

## 配置Prometheus-server数据采集：

## prometheus-server 验证指标数据：

[![pushgateway_data](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png "pushgateway_data")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png)

pushgateway_data

## 客户端推送多条数据-方式一：

## 客户端推送多条数据-方式二：

基于自定义脚本实现数据的收集和推送：

分别在不同主机执行脚本， 验证指标数据收集和推送：

```
root@prometheus-node1:~# bash memory_monitor.sh
root@prometheus-node2:~# bash memory_monitor.sh
```

验证prometheus-server能否抓取pushgateway的数据：

[![pushgateway_data](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png "pushgateway_data")](https://shackles.cn/Learning_pictures/Prometheus/pushgateway_data.png)

pushgateway_data

## Pushgateway指标数的删除：

1、通过API删除：

```
root@prometheus-node2:~# curl -X DELETE http://10.2.0.24:9091/metrics/job/custom_memory_monitor/instance/10.2.0.24
```

delete_pushgateway

# Prometheus Federation(联邦集群)：

Federation

## 部署Prometheus Server和node\_exporter的步骤

上方有，在此就不做过多介绍，详情请查看上方二进制安装

## 配置Prometheus(10.2.0.17)联邦节点收集node-exporter指标数据：

## 验证prometheus targets状态：

[![federate_targets](https://shackles.cn/Learning_pictures/Prometheus/federate_targets.png "federate_targets")](https://shackles.cn/Learning_pictures/Prometheus/federate_targets.png)

federate_targets

## 验证prometheus 通过联邦节点收集的node-exporter指标数据:

[![federate_date](https://shackles.cn/Learning_pictures/Prometheus/federate_date.png "federate_date")](https://shackles.cn/Learning_pictures/Prometheus/federate_date.png)

federate_date

promethues-简介

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Kubernetes + Harbor + Ingress 部署 SpringBoot + Vue 前后端分离项目

微服务架构闲聊

如果运维面试遇到水货面试官，怎么办？

k8s pod 不断重启 java OutOfMemoryError

kubernetes-组件架构

k8s—Prometheus+Grafana+Altermaneger+webhook构建监控平台

Jenkins整合Kubernetes

MySQL8.0.31集群网络抖动导致PRIMARY节点被T出集群

深入了解 Kubernetes Pod 的状态

promethues-简介

promethues-简介

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

promethues-简介

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款