部署 AWS Lambda 以停用無回應的站點

本指南說明如何在多站點部署中解決兩個站點之間的腦裂情境。如果其中一個站點發生故障，它也會停用複製，以便另一個站點可以繼續服務請求。

此部署旨在與多站點部署概念指南中描述的設定一起使用。請將此部署與多站點部署的建構模組指南中概述的其他建構模組一起使用。

我們提供這些藍圖來展示一個功能完整且具有良好基準效能的最小範例，適用於常規安裝。您仍然需要使其適應您的環境以及貴組織的標準和安全最佳實務。

架構

如果多站點部署中的站點之間發生網路通訊故障，則兩個站點將無法再繼續在它們之間複製資料。Infinispan 設定了 FAIL 故障策略，以確保一致性高於可用性。因此，所有使用者請求都會收到錯誤訊息，直到故障解決，無論是恢復網路連線還是停用跨站點複製。

在這種情況下，通常會使用仲裁來確定哪些站點被標記為線上或離線。但是，由於多站點部署僅包含兩個站點，因此這是不可行的。相反，我們利用「圍欄」來確保當其中一個站點無法連線到另一個站點時，只有一個站點保留在負載平衡器配置中，因此只有此站點能夠服務後續的使用者請求。

除了負載平衡器配置之外，圍欄程序還會停用兩個 Infinispan 叢集之間的複製，以允許從保留在負載平衡器配置中的站點服務使用者請求。因此，一旦停用複製，站點將會不同步。

若要從不同步狀態復原，需要手動重新同步，如同步站點中所述。這就是為什麼透過圍欄移除的站點在網路通訊故障解決時不會自動重新新增的原因。移除的站點僅應在兩個站點使用使站點上線中概述的程序同步後重新新增。

在本指南中，我們描述如何使用 Prometheus 警報和 AWS Lambda 函數的組合來實作圍欄。當 Infinispan 伺服器指標偵測到腦裂時，會觸發 Prometheus 警報，這會導致 Prometheus AlertManager 呼叫基於 AWS Lambda 的 webhook。觸發的 Lambda 函數會檢查目前的 Global Accelerator 配置，並移除報告為離線的站點。

在真正的腦裂情境中，兩個站點都還在運作但網路通訊中斷，兩個站點都可能同時觸發 webhook。我們透過確保一次只能執行一個 Lambda 執行個體來防止這種情況。AWS Lambda 中的邏輯確保在負載平衡器配置中始終保留一個站點項目。

先決條件

基於 ROSA HCP 的多站點 Keycloak 部署
已安裝 AWS CLI
AWS Global Accelerator 負載平衡器
已安裝 jq 工具

程序

啟用 Openshift 使用者警報路由

命令

kubectl apply -f - << EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true
EOF
kubectl -n openshift-user-workload-monitoring rollout status --watch statefulset.apps/alertmanager-user-workload

決定將用於驗證 Lambda webhook 的使用者名稱/密碼組合，並建立一個 AWS Secret 來儲存密碼
命令
```
aws secretsmanager create-secret \
  --name webhook-password \ (1)
  --secret-string changeme \ (2)
  --region eu-west-1 (3)
```
1 密碼的名稱

2 用於驗證的密碼

3 託管密碼的 AWS 區域

建立用於執行 Lambda 的角色。

命令

FUNCTION_NAME= (1)
ROLE_ARN=$(aws iam create-role \
  --role-name ${FUNCTION_NAME} \
  --assume-role-policy-document \
  '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "lambda.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }' \
  --query 'Role.Arn' \
  --region eu-west-1 \ (2)
  --output text
)

1	您選擇與 Lambda 和相關資源關聯的名稱
2	託管 Kubernetes 叢集的 AWS 區域

建立並附加 'LambdaSecretManager' 政策，以便 Lambda 可以存取 AWS Secrets

命令

POLICY_ARN=$(aws iam create-policy \
  --policy-name LambdaSecretManager \
  --policy-document \
  '{
      "Version": "2012-10-17",
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "secretsmanager:GetSecretValue"
              ],
              "Resource": "*"
          }
      ]
  }' \
  --query 'Policy.Arn' \
  --output text
)
aws iam attach-role-policy \
  --role-name ${FUNCTION_NAME} \
  --policy-arn ${POLICY_ARN}

附加 ElasticLoadBalancingReadOnly 政策，以便 Lambda 可以查詢已佈建的網路負載平衡器

命令

aws iam attach-role-policy \
  --role-name ${FUNCTION_NAME} \
  --policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingReadOnly

附加 GlobalAcceleratorFullAccess 政策，以便 Lambda 可以更新 Global Accelerator EndpointGroup

命令

aws iam attach-role-policy \
  --role-name ${FUNCTION_NAME} \
  --policy-arn arn:aws:iam::aws:policy/GlobalAcceleratorFullAccess

建立包含所需圍欄邏輯的 Lambda ZIP 檔案

命令

LAMBDA_ZIP=/tmp/lambda.zip
cat << EOF > /tmp/lambda.py

from urllib.error import HTTPError

import boto3
import jmespath
import json
import os
import urllib3

from base64 import b64decode
from urllib.parse import unquote

# Prevent unverified HTTPS connection warning
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


class MissingEnvironmentVariable(Exception):
    pass


class MissingSiteUrl(Exception):
    pass


def env(name):
    if name in os.environ:
        return os.environ[name]
    raise MissingEnvironmentVariable(f"Environment Variable '{name}' must be set")


def handle_site_offline(labels):
    a_client = boto3.client('globalaccelerator', region_name='us-west-2')

    acceleratorDNS = labels['accelerator']
    accelerator = jmespath.search(f"Accelerators[?(DnsName=='{acceleratorDNS}'|| DualStackDnsName=='{acceleratorDNS}')]", a_client.list_accelerators())
    if not accelerator:
        print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found")
        return

    accelerator_arn = accelerator[0]['AcceleratorArn']
    listener_arn = a_client.list_listeners(AcceleratorArn=accelerator_arn)['Listeners'][0]['ListenerArn']

    endpoint_group = a_client.list_endpoint_groups(ListenerArn=listener_arn)['EndpointGroups'][0]
    endpoints = endpoint_group['EndpointDescriptions']

    # Only update accelerator endpoints if two entries exist
    if len(endpoints) > 1:
        # If the reporter endpoint is not healthy then do nothing for now
        # A Lambda will eventually be triggered by the other offline site for this reporter
        reporter = labels['reporter']
        reporter_endpoint = [e for e in endpoints if endpoint_belongs_to_site(e, reporter)][0]
        if reporter_endpoint['HealthState'] == 'UNHEALTHY':
            print(f"Ignoring SiteOffline alert as reporter '{reporter}' endpoint is marked UNHEALTHY")
            return

        offline_site = labels['site']
        endpoints = [e for e in endpoints if not endpoint_belongs_to_site(e, offline_site)]
        del reporter_endpoint['HealthState']
        a_client.update_endpoint_group(
            EndpointGroupArn=endpoint_group['EndpointGroupArn'],
            EndpointConfigurations=endpoints
        )
        print(f"Removed site={offline_site} from Accelerator EndpointGroup")

        take_infinispan_site_offline(reporter, offline_site)
        print(f"Backup site={offline_site} caches taken offline")
    else:
        print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup")


def endpoint_belongs_to_site(endpoint, site):
    lb_arn = endpoint['EndpointId']
    region = lb_arn.split(':')[3]
    client = boto3.client('elbv2', region_name=region)
    tags = client.describe_tags(ResourceArns=[lb_arn])['TagDescriptions'][0]['Tags']
    for tag in tags:
        if tag['Key'] == 'site':
            return tag['Value'] == site
    return false


def take_infinispan_site_offline(reporter, offlinesite):
    endpoints = json.loads(INFINISPAN_SITE_ENDPOINTS)
    if reporter not in endpoints:
        raise MissingSiteUrl(f"Missing URL for site '{reporter}' in 'INFINISPAN_SITE_ENDPOINTS' json")

    endpoint = endpoints[reporter]
    password = get_secret(INFINISPAN_USER_SECRET)
    url = f"https://{endpoint}/rest/v2/container/x-site/backups/{offlinesite}?action=take-offline"
    http = urllib3.PoolManager(cert_reqs='CERT_NONE')
    headers = urllib3.make_headers(basic_auth=f"{INFINISPAN_USER}:{password}")
    try:
        rsp = http.request("POST", url, headers=headers)
        if rsp.status >= 400:
            raise HTTPError(f"Unexpected response status '%d' when taking site offline", rsp.status)
        rsp.release_conn()
    except HTTPError as e:
        print(f"HTTP error encountered: {e}")


def get_secret(secret_name):
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=SECRETS_REGION
    )
    return client.get_secret_value(SecretId=secret_name)['SecretString']


def decode_basic_auth_header(encoded_str):
    split = encoded_str.strip().split(' ')
    if len(split) == 2:
        if split[0].strip().lower() == 'basic':
            try:
                username, password = b64decode(split[1]).decode().split(':', 1)
            except:
                raise DecodeError
        else:
            raise DecodeError
    else:
        raise DecodeError

    return unquote(username), unquote(password)


def handler(event, context):
    print(json.dumps(event))

    authorization = event['headers'].get('authorization')
    if authorization is None:
        print("'Authorization' header missing from request")
        return {
            "statusCode": 401
        }

    expectedPass = get_secret(WEBHOOK_USER_SECRET)
    username, password = decode_basic_auth_header(authorization)
    if username != WEBHOOK_USER and password != expectedPass:
        print('Invalid username/password combination')
        return {
            "statusCode": 403
        }

    body = event.get('body')
    if body is None:
        raise Exception('Empty request body')

    body = json.loads(body)
    print(json.dumps(body))

    if body['status'] != 'firing':
        print("Ignoring alert as status is not 'firing', status was: '%s'" % body['status'])
        return {
            "statusCode": 204
        }

    for alert in body['alerts']:
        labels = alert['labels']
        if labels['alertname'] == 'SiteOffline':
            handle_site_offline(labels)

    return {
        "statusCode": 204
    }


INFINISPAN_USER = env('INFINISPAN_USER')
INFINISPAN_USER_SECRET = env('INFINISPAN_USER_SECRET')
INFINISPAN_SITE_ENDPOINTS = env('INFINISPAN_SITE_ENDPOINTS')
SECRETS_REGION = env('SECRETS_REGION')
WEBHOOK_USER = env('WEBHOOK_USER')
WEBHOOK_USER_SECRET = env('WEBHOOK_USER_SECRET')

EOF
zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py

建立 Lambda 函數。

命令

aws lambda create-function \
  --function-name ${FUNCTION_NAME} \
  --zip-file fileb://${LAMBDA_ZIP} \
  --handler lambda.handler \
  --runtime python3.12 \
  --role ${ROLE_ARN} \
  --region eu-west-1 (1)

1	託管 Kubernetes 叢集的 AWS 區域

公開函數 URL，以便 Lambda 可以作為 webhook 觸發

命令

aws lambda create-function-url-config \
  --function-name ${FUNCTION_NAME} \
  --auth-type NONE \
  --region eu-west-1 (1)

1	託管 Kubernetes 叢集的 AWS 區域

允許公開呼叫函數 URL

命令

aws lambda add-permission \
  --action "lambda:InvokeFunctionUrl" \
  --function-name ${FUNCTION_NAME} \
  --principal "*" \
  --statement-id FunctionURLAllowPublicAccess \
  --function-url-auth-type NONE \
  --region eu-west-1 (1)

1	託管 Kubernetes 叢集的 AWS 區域

設定 Lambda 的環境變數

在每個 Kubernetes 叢集中，擷取公開的 Infinispan URL 端點
```
kubectl -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}' (1)
```
1 將 ${NAMESPACE} 替換為包含 Infinispan 伺服器的命名空間

上傳所需的環境變數

ACCELERATOR_NAME= (1)
LAMBDA_REGION= (2)
CLUSTER_1_NAME= (3)
CLUSTER_1_ISPN_ENDPOINT= (4)
CLUSTER_2_NAME= (5)
CLUSTER_2_ISPN_ENDPOINT= (6)
INFINISPAN_USER= (7)
INFINISPAN_USER_SECRET= (8)
WEBHOOK_USER= (9)
WEBHOOK_USER_SECRET= (10)

INFINISPAN_SITE_ENDPOINTS=$(echo "{\"${CLUSTER_NAME_1}\":\"${CLUSTER_1_ISPN_ENDPOINT}\",\"${CLUSTER_2_NAME}\":\"${CLUSTER_2_ISPN_ENDPOINT\"}" | jq tostring)
aws lambda update-function-configuration \
    --function-name ${ACCELERATOR_NAME} \
    --region ${LAMBDA_REGION} \
    --environment "{
      \"Variables\": {
        \"INFINISPAN_USER\" : \"${INFINISPAN_USER}\",
        \"INFINISPAN_USER_SECRET\" : \"${INFINISPAN_USER_SECRET}\",
        \"INFINISPAN_SITE_ENDPOINTS\" : ${INFINISPAN_SITE_ENDPOINTS},
        \"WEBHOOK_USER\" : \"${WEBHOOK_USER}\",
        \"WEBHOOK_USER_SECRET\" : \"${WEBHOOK_USER_SECERT}\",
        \"SECRETS_REGION\" : \"eu-central-1\"
      }
    }"

1	您的部署使用的 AWS Global Accelerator 名稱
2	託管您的 Kubernetes 叢集和 Lambda 函數的 AWS 區域
3	您的 Infinispan 站點之一的名稱，如使用 Infinispan Operator 部署 HA 的 Infinispan 中所定義
4	與 CLUSER_1_NAME 站點關聯的 Infinispan 端點 URL
5	第二個 Infinispan 站點的名稱
6	與 CLUSER_2_NAME 站點關聯的 Infinispan 端點 URL
7	具有足夠權限可在伺服器上執行 REST 請求的 Infinispan 使用者的使用者名稱
8	包含與 Infinispan 使用者關聯的密碼的 AWS 密碼名稱
9	用於驗證對 Lambda 函數請求的使用者名稱
10	包含用於驗證對 Lambda 函數請求的密碼的 AWS 密碼名稱

擷取 Lambda 函數 URL

命令

aws lambda get-function-url-config \
  --function-name ${FUNCTION_NAME} \
  --query "FunctionUrl" \
  --region eu-west-1 \(1)
  --output text

1	建立 Lambda 的 AWS 區域

輸出

https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws

在每個 Kubernetes 叢集中，設定 Prometheus 警報路由，以在腦裂時觸發 Lambda

命令

NAMESPACE= # The namespace containing your deployments
kubectl apply -n ${NAMESPACE} -f - << EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/basic-auth
metadata:
  name: webhook-credentials
stringData:
  username: 'keycloak' (1)
  password: 'changme' (2)
---
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: example-routing
spec:
  route:
    receiver: default
    groupBy:
      - accelerator
    groupInterval: 90s
    groupWait: 60s
    matchers:
      - matchType: =
        name: alertname
        value: SiteOffline
  receivers:
    - name: default
      webhookConfigs:
        - url: 'https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws/' (3)
          httpConfig:
            basicAuth:
              username:
                key: username
                name: webhook-credentials
              password:
                key: password
                name: webhook-credentials
            tlsConfig:
              insecureSkipVerify: true
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: xsite-status
spec:
  groups:
    - name: xsite-status
      rules:
        - alert: SiteOffline
          expr: 'min by (namespace, site) (vendor_jgroups_site_view_status{namespace="default",site="site-b"}) == 0' (4)
          labels:
            severity: critical
            reporter: site-a (5)
            accelerator: a3da6a6cbd4e27b02.awsglobalaccelerator.com (6)

1	驗證 Lambda 請求所需的使用者名稱
2	驗證 Lambda 請求所需的密碼
3	Lambda 函數 URL
4	命名空間值應為託管 Infinispan CR 的命名空間，而站點應為 Infinispan CR 中 `spec.service.sites.locations[0].name` 定義的遠端站點
5	您的本機站點名稱，由您的 Infinispan CR 中的 `spec.service.sites.local.name` 定義
6	您的 Global Accelerator 的 DNS

驗證

若要測試 Prometheus 警報是否如預期般觸發 webhook，請執行以下步驟來模擬腦裂

在您的每個叢集中執行以下操作

命令

kubectl -n openshift-operators scale --replicas=0 deployment/infinispan-operator-controller-manager (1)
kubectl -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager
kubectl -n ${NAMESPACE} scale --replicas=0 deployment/infinispan-router (2)
kubectl -n ${NAMESPACE} rollout status -w deployment/infinispan-router

1	縮減 Infinispan Operator，以便下一步不會導致部署被 operator 重新建立
2	縮減 Gossip Router 部署。將 `${NAMESPACE}` 替換為包含 Infinispan 伺服器的命名空間

透過檢查 Openshift 主控台中的觀察 → 警報功能表，驗證是否已在叢集上觸發 SiteOffline 事件
檢查 AWS 主控台中的 Global Accelerator EndpointGroup，應該只有一個端點存在

擴展 Infinispan Operator 和 Gossip Router，以重新建立站點之間的連線

命令

kubectl -n openshift-operators scale --replicas=1 deployment/infinispan-operator-controller-manager
kubectl -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager
kubectl -n ${NAMESPACE} scale --replicas=1 deployment/infinispan-router (1)
kubectl -n ${NAMESPACE} rollout status -w deployment/infinispan-router

1	將 `${NAMESPACE}` 替換為包含 Infinispan 伺服器的命名空間

檢查每個站點中的 vendor_jgroups_site_view_status 指標。值為 1 表示該站點可連線。
更新 Accelerator EndpointGroup 以包含兩個端點。有關詳細資訊，請參閱使站點上線指南。

1	密碼的名稱
2	用於驗證的密碼
3	託管密碼的 AWS 區域

架構

先決條件

程序

驗證

延伸閱讀