可观测性最佳实践
使用 Prometheus 进行生产规模监控
使用 分层联邦 并结合一系列 录制规则 是在生产环境中使用 Prometheus 监控 Istio 服务网格的推荐方法。
虽然默认情况下安装 Istio 不会部署 Prometheus,但 入门 指南中安装了在 Prometheus 集成指南 中描述的 选项 1:快速入门
部署的 Prometheus。此 Prometheus 部署故意配置了非常短的保留窗口(6 小时)。快速入门 Prometheus 部署还配置为从网格中运行的每个 Envoy 代理收集指标,并使用一组关于其来源的标签(instance
、pod
和 namespace
)来增强每个指标。
通过录制规则进行工作负载级聚合
为了跨实例和 Pod 聚合指标,请使用以下录制规则更新默认的 Prometheus 配置
groups:
- name: "istio.recording-rules"
interval: 5s
rules:
- record: "workload:istio_requests_total"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)
- record: "workload:istio_request_duration_milliseconds_count"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)
- record: "workload:istio_request_duration_milliseconds_sum"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)
- record: "workload:istio_request_duration_milliseconds_bucket"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)
- record: "workload:istio_request_bytes_count"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)
- record: "workload:istio_request_bytes_sum"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)
- record: "workload:istio_request_bytes_bucket"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)
- record: "workload:istio_response_bytes_count"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)
- record: "workload:istio_response_bytes_sum"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)
- record: "workload:istio_response_bytes_bucket"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)
- record: "workload:istio_tcp_sent_bytes_total"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total)
- record: "workload:istio_tcp_received_bytes_total"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total)
- record: "workload:istio_tcp_connections_opened_total"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)
- record: "workload:istio_tcp_connections_closed_total"
expr: |
sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-metrics-aggregation
labels:
app.kubernetes.io/name: istio-prometheus
spec:
groups:
- name: "istio.metricsAggregation-rules"
interval: 5s
rules:
- record: "workload:istio_requests_total"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)"
- record: "workload:istio_request_duration_milliseconds_count"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)"
- record: "workload:istio_request_duration_milliseconds_sum"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)"
- record: "workload:istio_request_duration_milliseconds_bucket"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)"
- record: "workload:istio_request_bytes_count"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)"
- record: "workload:istio_request_bytes_sum"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)"
- record: "workload:istio_request_bytes_bucket"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)"
- record: "workload:istio_response_bytes_count"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)"
- record: "workload:istio_response_bytes_sum"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)"
- record: "workload:istio_response_bytes_bucket"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)"
- record: "workload:istio_tcp_sent_bytes_total"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total)"
- record: "workload:istio_tcp_received_bytes_total"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total)"
- record: "workload:istio_tcp_connections_opened_total"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)"
- record: "workload:istio_tcp_connections_closed_total"
expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)"
使用工作负载级聚合指标进行联合
要建立 Prometheus 联邦,请修改您已准备好的生产环境中 Prometheus 的配置,使其抓取 Istio Prometheus 的联邦端点。
将以下作业添加到您的配置中
- job_name: 'istio-prometheus'
honor_labels: true
metrics_path: '/federate'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['istio-system']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'workload:(.*)'
target_label: __name__
action: replace
params:
'match[]':
- '{__name__=~"workload:(.*)"}'
- '{__name__=~"pilot(.*)"}'
如果您使用的是 Prometheus Operator,请改用以下配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-federation
labels:
app.kubernetes.io/name: istio-prometheus
spec:
namespaceSelector:
matchNames:
- istio-system
selector:
matchLabels:
app: prometheus
endpoints:
- interval: 30s
scrapeTimeout: 30s
params:
'match[]':
- '{__name__=~"workload:(.*)"}'
- '{__name__=~"pilot(.*)"}'
path: /federate
targetPort: 9090
honorLabels: true
metricRelabelings:
- sourceLabels: ["__name__"]
regex: 'workload:(.*)'
targetLabel: "__name__"
action: replace
使用录制规则优化指标收集
除了使用录制规则 跨 Pod 和实例聚合 之外,您可能还希望使用录制规则生成专门针对您现有仪表板和警报的聚合指标。以这种方式优化收集可以节省生产环境中 Prometheus 实例的大量资源消耗,并加快查询性能。
例如,假设一个自定义监控仪表板使用了以下 Prometheus 查询
过去一分钟内按目标服务名称和命名空间平均的请求总数
sum(irate(istio_requests_total{reporter="source"}[1m])) by ( destination_canonical_service, destination_workload_namespace )
过去一分钟内按源和目标服务名称以及命名空间平均的 P95 客户端延迟
histogram_quantile(0.95, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by ( destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le ) )
可以使用以下录制规则集添加到 Istio Prometheus 配置中,使用 istio
前缀使识别这些用于联邦的指标变得简单。
groups:
- name: "istio.recording-rules"
interval: 5s
rules:
- record: "istio:istio_requests:by_destination_service:rate1m"
expr: |
sum(irate(istio_requests_total{reporter="destination"}[1m]))
by (
destination_canonical_service,
destination_workload_namespace
)
- record: "istio:istio_request_duration_milliseconds_bucket:p95:rate1m"
expr: |
histogram_quantile(0.95,
sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
by (
destination_canonical_service,
destination_workload_namespace,
source_canonical_service,
source_workload_namespace,
le
)
)
然后将生产实例中的 Prometheus 更新为从 Istio 实例联合,其中
匹配子句为
{__name__=~"istio:(.*)"}
指标重命名配置为:
regex: "istio:(.*)"
然后将原始查询替换为
istio_requests:by_destination_service:rate1m
avg(istio_request_duration_milliseconds_bucket:p95:rate1m)