流量管理问题

15分钟阅读

Envoy 拒绝请求

请求可能因各种原因被拒绝。了解请求被拒绝原因的最佳方法是检查 Envoy 的访问日志。默认情况下，访问日志输出到容器的标准输出。运行以下命令查看日志

$ kubectl logs PODNAME -c istio-proxy -n NAMESPACE

在默认的访问日志格式中，Envoy 响应标志位于响应代码之后，如果您使用自定义日志格式，请确保包含%RESPONSE_FLAGS%。

有关响应标志的详细信息，请参阅Envoy 响应标志。

常见的响应标志有

NR：未配置路由，请检查您的DestinationRule或VirtualService。
UO：上游溢出，发生断路，请检查您在DestinationRule中的断路器配置。
UF：无法连接到上游，如果您使用 Istio 认证，请检查是否存在双向 TLS 配置冲突。

路由规则似乎不影响流量流向

使用当前的 Envoy sidecar 实现，可能需要多达 100 个请求才能观察到加权版本分发。

如果路由规则对Bookinfo示例完美运行，但类似的版本路由规则对您自己的应用程序没有影响，则可能是您的 Kubernetes 服务需要稍作更改。Kubernetes 服务必须遵守某些限制才能利用 Istio 的 L7 路由功能。有关详细信息，请参阅Pod 和服务的必要条件。

另一个潜在问题是路由规则可能只是生效缓慢。Kubernetes 上的 Istio 实现利用最终一致算法来确保所有 Envoy sidecar 都具有正确的配置，包括所有路由规则。配置更改需要一些时间才能传播到所有 sidecar。对于大型部署，传播将花费更长时间，并且可能存在几秒钟的延迟。

设置目标规则后出现 503 错误

如果您应用了DestinationRule后，对服务的请求立即开始生成 HTTP 503 错误，并且这些错误持续存在，直到您删除或恢复DestinationRule，则DestinationRule可能导致服务发生 TLS 冲突。

例如，如果您在集群中全局配置双向 TLS，则DestinationRule必须包含以下trafficPolicy

trafficPolicy:
  tls:
    mode: ISTIO_MUTUAL

否则，模式默认为DISABLE，导致客户端代理 sidecar 进行普通 HTTP 请求而不是 TLS 加密请求。因此，请求与服务器代理冲突，因为服务器代理期望加密请求。

每当您应用DestinationRule时，请确保trafficPolicy TLS 模式与全局服务器配置匹配。

路由规则对入口网关请求没有影响

假设您正在使用入口Gateway和相应的VirtualService来访问内部服务。例如，您的VirtualService如下所示

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
  gateways:
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
  - match:
    ...

您还有一个VirtualService，它将 helloworld 服务的流量路由到特定的子集

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
  - helloworld.default.svc.cluster.local
  http:
  - route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1

在这种情况下，您会注意到，通过入口网关对 helloworld 服务的请求不会定向到子集 v1，而是继续使用默认的轮循路由。

入口请求使用网关主机（例如，myapp.com），这将激活myapp VirtualService中的规则，该规则将路由到 helloworld 服务的任何端点。只有使用主机helloworld.default.svc.cluster.local的内部请求才会使用 helloworld VirtualService，该服务将流量专门定向到子集 v1。

要控制来自网关的流量，您还需要在myapp VirtualService中包含子集规则

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
  gateways:
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1
  - match:
    ...

或者，如果可能，您可以将两个VirtualServices合并为一个单元

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp.com # cannot use "*" here since this is being combined with the mesh services
  - helloworld.default.svc.cluster.local
  gateways:
  - mesh # applies internally as well as externally
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
      gateways:
      - myapp-gateway #restricts this rule to apply only to ingress gateway
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1
  - match:
    - gateways:
      - mesh # applies to all services inside the mesh
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1

Envoy 在负载下崩溃

检查您的ulimit -a。许多系统默认情况下具有 1024 个打开的文件描述符限制，这将导致 Envoy 断言并崩溃，并显示以下信息：

[2017-05-17 03:00:52.735][14236][critical][assert] assert failure: fd_ != -1: external/envoy/source/common/network/connection_impl.cc:58

确保提高您的 ulimit。例如：ulimit -n 16384

Envoy 无法连接到我的 HTTP/1.0 服务

Envoy 要求上游服务使用HTTP/1.1或HTTP/2流量。例如，当使用NGINX为 Envoy 后面的流量提供服务时，您需要在 NGINX 配置中将proxy_http_version指令设置为“1.1”，因为 NGINX 默认值为 1.0。

示例配置

upstream http_backend {
    server 127.0.0.1:8080;

    keepalive 16;
}

server {
    ...

    location /http/ {
        proxy_pass http://http_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        ...
    }
}

访问无头服务时出现 503 错误

假设 Istio 使用以下配置安装

网格内mTLS 模式设置为STRICT
meshConfig.outboundTrafficPolicy.mode设置为ALLOW_ANY

假设nginx作为StatefulSet部署在默认命名空间中，并且如下所示定义了相应的无头服务

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: http-web  # Explicitly defining an http port
  clusterIP: None   # Creates a Headless Service
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web

服务定义中的端口名称http-web显式地为该端口指定了 http 协议。

假设我们在默认命名空间中也拥有一个curl pod Deployment。当从这个curl pod 使用其 Pod IP 访问nginx时（这是访问无头服务的一种常见方法），请求通过PassthroughCluster到达服务器端，但服务器端的 sidecar 代理无法找到到nginx的路由条目，并以HTTP 503 UC失败。

$ export SOURCE_POD=$(kubectl get pod -l app=curl -o jsonpath='{.items..metadata.name}')
$ kubectl exec -it $SOURCE_POD -c curl -- curl 10.1.1.171 -s -o /dev/null -w "%{http_code}"
  503

10.1.1.171是nginx的一个副本的 Pod IP，并且服务在containerPort 80 上访问。

以下是一些避免此 503 错误的方法

指定正确的 Host 标头

上面 curl 请求中的 Host 标头默认情况下将是 Pod IP。在我们的请求中将 Host 标头指定为nginx.default发送到nginx，成功返回HTTP 200 OK。

$ export SOURCE_POD=$(kubectl get pod -l app=curl -o jsonpath='{.items..metadata.name}')
$ kubectl exec -it $SOURCE_POD -c curl -- curl -H "Host: nginx.default" 10.1.1.171 -s -o /dev/null -w "%{http_code}"
  200

将端口名称设置为tcp或tcp-web或tcp-<custom_name>
这里协议显式地指定为tcp。在这种情况下，客户端和服务器端的 sidecar 代理仅使用TCP Proxy网络过滤器。根本不使用 HTTP 连接管理器，因此请求中不期望任何类型的标头。
带有或不带显式设置 Host 标头的对nginx的请求成功返回HTTP 200 OK。
这在某些情况下很有用，在这些情况下，客户端可能无法在请求中包含标头信息。
```
$ export SOURCE_POD=$(kubectl get pod -l app=curl -o jsonpath='{.items..metadata.name}')
$ kubectl exec -it $SOURCE_POD -c curl -- curl 10.1.1.171 -s -o /dev/null -w "%{http_code}"
  200
```
```
$ kubectl exec -it $SOURCE_POD -c curl -- curl -H "Host: nginx.default" 10.1.1.171 -s -o /dev/null -w "%{http_code}"
  200
```

使用域名而不是 Pod IP

也可以仅使用域名访问无头服务的特定实例。

$ export SOURCE_POD=$(kubectl get pod -l app=curl -o jsonpath='{.items..metadata.name}')
$ kubectl exec -it $SOURCE_POD -c curl -- curl web-0.nginx.default -s -o /dev/null -w "%{http_code}"
  200

这里web-0是nginx的 3 个副本之一的 pod 名称。

有关无头服务和不同协议的流量路由行为的其他信息，请参阅此流量路由页面。

TLS 配置错误

许多流量管理问题是由不正确的TLS 配置引起的。以下部分描述了一些最常见的错误配置。

将 HTTPS 发送到 HTTP 端口

如果您的应用程序将 HTTPS 请求发送到声明为 HTTP 的服务，则 Envoy sidecar 将尝试在转发请求时将其解析为 HTTP，这将失败，因为 HTTP 意外地被加密了。

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: httpbin
spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: http
    protocol: HTTP
  resolution: DNS

虽然如果您有意在端口 443 上发送纯文本（例如，curl http://httpbin.org:443），则上述配置可能是正确的，但通常端口 443 专用于 HTTPS 流量。

发送类似curl https://httpbin.org的 HTTPS 请求（默认为端口 443）将导致错误，例如curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number。访问日志也可能显示错误，例如400 DPE。

要解决此问题，您应将端口协议更改为 HTTPS

spec:
  ports:
  - number: 443
    name: https
    protocol: HTTPS

网关到虚拟服务的 TLS 不匹配

将虚拟服务绑定到网关时，可能会发生两种常见的 TLS 不匹配。

网关终止 TLS，而虚拟服务配置 TLS 路由。
网关执行 TLS 直通，而虚拟服务配置 HTTP 路由。

带有 TLS 终止的网关

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    hosts:
      - "*"
    tls:
      mode: SIMPLE
      credentialName: sds-credential
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*.example.com"
  gateways:
  - istio-system/gateway
  tls:
  - match:
    - sniHosts:
      - "*.example.com"
    route:
    - destination:
        host: httpbin.org

在此示例中，网关正在终止 TLS（网关的tls.mode配置为SIMPLE，而不是PASSTHROUGH），而虚拟服务正在使用基于 TLS 的路由。评估路由规则发生在网关终止 TLS 之后，因此 TLS 规则将不起作用，因为请求随后为 HTTP 而不是 HTTPS。

由于此错误配置，您最终将获得 404 响应，因为请求将发送到 HTTP 路由，但没有配置 HTTP 路由。您可以使用istioctl proxy-config routes命令确认这一点。

要解决此问题，您应切换虚拟服务以指定http路由，而不是tls

spec:
  ...
  http:
  - match:
    - headers:
        ":authority":
          regex: "*.example.com"

带有 TLS 直通的网关

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: PASSTHROUGH
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: virtual-service
spec:
  gateways:
  - gateway
  hosts:
  - httpbin.example.com
  http:
  - route:
    - destination:
        host: httpbin.org

在此配置中，虚拟服务尝试将 HTTP 流量与通过网关传递的 TLS 流量匹配。这将导致虚拟服务配置无效。您可以使用istioctl proxy-config listener和istioctl proxy-config route命令观察 HTTP 路由未应用。

要解决此问题，您应切换虚拟服务以配置tls路由

spec:
  tls:
  - match:
    - sniHosts: ["httpbin.example.com"]
    route:
    - destination:
        host: httpbin.org

或者，您可以通过切换网关中的tls配置来终止 TLS，而不是将其直通。

spec:
  ...
    tls:
      credentialName: sds-credential
      mode: SIMPLE

双重 TLS（针对 TLS 请求的 TLS 发起）

在配置 Istio 执行TLS 发起时，您需要确保应用程序将纯文本请求发送到 sidecar，然后 sidecar 将发起 TLS。

以下DestinationRule为对httpbin.org服务的请求发起 TLS，但相应的ServiceEntry将端口 443 上的协议定义为 HTTPS。

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: httpbin
spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: originate-tls
spec:
  host: httpbin.org
  trafficPolicy:
    tls:
      mode: SIMPLE

使用此配置，sidecar 期望应用程序在端口 443 上发送 TLS 流量（例如，curl https://httpbin.org），但它还将在转发请求之前执行 TLS 发起。这将导致请求被双重加密。

例如，发送类似curl https://httpbin.org的请求将导致错误：(35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number。

您可以通过将ServiceEntry中的端口协议更改为 HTTP 来修复此示例

spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: http
    protocol: HTTP

请注意，使用此配置，您的应用程序需要向 443 端口发送明文请求，例如 curl http://httpbin.org:443，因为 TLS 发起不会更改端口。但是，从 Istio 1.8 开始，您可以将 HTTP 80 端口暴露给应用程序（例如，curl http://httpbin.org），然后将请求重定向到 targetPort 443 以进行 TLS 发起。

spec:
  hosts:
  - httpbin.org
  ports:
  - number: 80
    name: http
    protocol: HTTP
    targetPort: 443

当多个网关配置了相同的 TLS 证书时，出现 404 错误

使用相同 TLS 证书配置多个网关会导致利用 HTTP/2 连接重用的浏览器（即大多数浏览器）在连接到另一个主机后访问第二个主机时产生 404 错误。

例如，假设您有两个共享相同 TLS 证书的主机，如下所示

通配符证书 *.test.com 安装在 istio-ingressgateway 中
Gateway 配置 gw1，主机为 service1.test.com，选择器为 istio: ingressgateway，并使用网关挂载的（通配符）证书进行 TLS
Gateway 配置 gw2，主机为 service2.test.com，选择器为 istio: ingressgateway，并使用网关挂载的（通配符）证书进行 TLS
VirtualService 配置 vs1，主机为 service1.test.com，网关为 gw1
VirtualService 配置 vs2，主机为 service2.test.com，网关为 gw2

由于两个网关都由同一个工作负载提供服务（即选择器 istio: ingressgateway），因此对这两个服务（service1.test.com 和 service2.test.com）的请求将解析为相同的 IP。如果首先访问 service1.test.com，它将返回通配符证书 (*.test.com)，指示对 service2.test.com 的连接可以使用相同的证书。因此，Chrome 和 Firefox 等浏览器将重用现有连接来处理对 service2.test.com 的请求。由于网关 (gw1) 没有 service2.test.com 的路由，因此它将返回 404 (未找到) 响应。

您可以通过配置单个通配符 Gateway 来避免此问题，而不是两个 (gw1 和 gw2)。然后，只需将这两个 VirtualServices 绑定到它，如下所示

Gateway 配置 gw，主机为 *.test.com，选择器为 istio: ingressgateway，并使用网关挂载的（通配符）证书进行 TLS
VirtualService 配置 vs1，主机为 service1.test.com，网关为 gw
VirtualService 配置 vs2，主机为 service2.test.com，网关为 gw

在未发送 SNI 时配置 SNI 路由

指定 hosts 字段的 HTTPS Gateway 将对传入请求执行 SNI 匹配。例如，以下配置将仅允许 SNI 中匹配 *.example.com 的请求。

servers:
- port:
    number: 443
    name: https
    protocol: HTTPS
  hosts:
  - "*.example.com"

这可能会导致某些请求失败。

例如，如果您没有设置 DNS，而是直接设置主机头，例如 curl 1.2.3.4 -H "Host: app.example.com"，则不会设置 SNI，导致请求失败。相反，您可以设置 DNS 或使用 curl 的 --resolve 标志。有关更多信息，请参阅安全网关任务。

另一个常见问题是在 Istio 前面的负载均衡器。大多数云负载均衡器不会转发 SNI，因此，如果您在云负载均衡器中终止 TLS，则可能需要执行以下操作之一

配置云负载均衡器以代替传递 TLS 连接
通过将 hosts 字段设置为 * 在 Gateway 中禁用 SNI 匹配

这种情况的一个常见症状是负载均衡器健康检查成功，而实际流量失败。

未更改的 Envoy 过滤器配置突然停止工作

指定相对于另一个过滤器插入位置的 EnvoyFilter 配置可能非常脆弱，因为默认情况下，评估顺序基于过滤器的创建时间。考虑具有以下规范的过滤器

spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: SIDECAR_OUTBOUND
      listener:
        portNumber: 443
        filterChain:
          filter:
            name: istio.stats
    patch:
      operation: INSERT_BEFORE
      value:
        ...

为了正常工作，此过滤器配置依赖于 istio.stats 过滤器具有比它更早的创建时间。否则，INSERT_BEFORE 操作将被静默忽略。错误日志中没有任何内容表明此过滤器尚未添加到链中。

当匹配像 istio.stats 这样的特定于版本的过滤器（即在其匹配条件中包含 proxyVersion 字段）时，这尤其成问题。在升级 Istio 时，此类过滤器可能会被删除或替换为更新的过滤器。因此，像上面这样的 EnvoyFilter 最初可能运行得很好，但在将 Istio 升级到较新版本后，它将不再包含在 sidecar 的网络过滤器链中。

为了避免此问题，您可以将操作更改为不依赖于另一个过滤器存在的操作（例如，INSERT_FIRST），或者在 EnvoyFilter 中设置显式优先级以覆盖默认的基于创建时间的排序。例如，在上述过滤器中添加 priority: 10 将确保在 istio.stats 过滤器（其默认优先级为 0）之后处理它。

具有故障注入和重试/超时策略的虚拟服务无法按预期工作

目前，Istio 不支持在同一个 VirtualService 上配置故障注入和重试或超时策略。考虑以下配置

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
    - "*"
  gateways:
  - helloworld-gateway
  http:
  - match:
    - uri:
        exact: /hello
    fault:
      abort:
        httpStatus: 500
        percentage:
          value: 50
    retries:
      attempts: 5
      retryOn: 5xx
    route:
    - destination:
        host: helloworld
        port:
          number: 5000

您可能会认为，鉴于配置了五次重试尝试，用户在调用 helloworld 服务时几乎不会看到任何错误。但是，由于故障和重试都在同一个 VirtualService 上配置，因此重试配置不会生效，导致 50% 的失败率。要解决此问题，您可以从 VirtualService 中删除故障配置，而是使用 EnvoyFilter 将故障注入到上游 Envoy 代理

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: hello-world-filter
spec:
  workloadSelector:
    labels:
      app: helloworld
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND # will match outbound listeners in all sidecars
      listener:
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.fault
        typed_config:
          "@type": "type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault"
          abort:
            http_status: 500
            percentage:
              numerator: 50
              denominator: HUNDRED

这样做有效，因为这样，重试策略是为客户端代理配置的，而故障注入是为上游代理配置的。