Configuring IP failover

    IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set is serviced by a node selected from the set. As long a single node is available, the VIPs are served. There is no way to explicitly distribute the VIPs over the nodes, so there can be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs are on it.

    IP failover monitors a port on each VIP to determine whether the port is reachable on the node. If the port is not reachable, the VIP is not assigned to the node. If the port is set to , this check is suppressed. The check script does the needed testing.

    IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to determine which host, from the set of hosts, services which VIP. If a host becomes unavailable, or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. This means a VIP is always serviced as long as a host is available.

    When a node running Keepalived passes the check script, the VIP on that node can enter the master state based on its priority and the priority of the current master and as determined by the preemption strategy.

    A cluster administrator can provide a script through the OPENSHIFT_HA_NOTIFY_SCRIPT variable, and this script is called whenever the state of the VIP on the node changes. Keepalived uses the master state when it is servicing the VIP, the backup state when another node is servicing the VIP, or in the fault state when the check script fails. The notify script is called with the new state whenever the state changes.

    You can create an IP failover deployment configuration on OKD. The IP failover deployment configuration specifies the set of VIP addresses, and the set of nodes on which to service them. A cluster can have multiple IP failover deployment configurations, with each managing its own set of unique VIP addresses. Each node in the IP failover configuration runs an IP failover pod, and this pod runs Keepalived.

    When using VIPs to access a pod with host networking, the application pod runs on all nodes that are running the IP failover pods. This enables any of the IP failover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with IP failover, either some IP failover nodes never service the VIPs or some application pods never receive any traffic. Use the same selector and replication count, for both IP failover and the application pods, to avoid this mismatch.

    While using VIPs to access a service, any of the nodes can be in the IP failover set of nodes, since the service is reachable on all nodes, no matter where the application pod is running. Any of the IP failover nodes can become master at any time. The service can either use external IPs and a service port or it can use a NodePort.

    When using external IPs in the service definition, the VIPs are set to the external IPs, and the IP failover monitoring port is set to the service port. When using a node port, the port is open on every node in the cluster, and the service load-balances traffic from whatever node currently services the VIP. In this case, the IP failover monitoring port is set to the NodePort in the service definition.

    Setting up a NodePort is a privileged operation.

    Even though a service VIP is highly available, performance can still be affected. Keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs can end up on the same node even when other nodes have none. Strategies that externally load-balance across a set of VIPs can be thwarted when IP failover puts multiple VIPs on the same node.

    When you use ingressIP, you can set up IP failover to have the same VIP range as the ingressIP range. You can also disable the monitoring port. In this case, all the VIPs appear on same node in the cluster. Any user can set up a service with an ingressIP and have it highly available.

    There are a maximum of 254 VIPs in the cluster.

    The following table contains the variables used to configure IP failover.

    Table 1. IP failover environment variables
    Variable NameDefaultDescription

    OPENSHIFT_HA_MONITOR_PORT

    80

    The IP failover pod tries to open a TCP connection to this port on each Virtual IP (VIP). If connection is established, the service is considered to be running. If this port is set to 0, the test always passes.

    OPENSHIFT_HA_NETWORK_INTERFACE

    The interface name that IP failover uses to send Virtual Router Redundancy Protocol (VRRP) traffic. The default value is eth0.

    OPENSHIFT_HA_REPLICA_COUNT

    2

    The number of replicas to create. This must match spec.replicas value in IP failover deployment configuration.

    OPENSHIFT_HA_VIRTUAL_IPS

    The list of IP address ranges to replicate. This must be provided. For example, 1.2.3.4-6,1.2.3.9.

    OPENSHIFT_HA_VRRP_ID_OFFSET

    0

    The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is 0, and the allowed range is 0 through 255.

    OPENSHIFT_HA_VIP_GROUPS

    The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the OPENSHIFT_HA_VIP_GROUPS variable.

    OPENSHIFT_HA_IPTABLES_CHAIN

    INPUT

    The name of the iptables chain, to automatically add an iptables rule to allow the VRRP traffic on. If the value is not set, an iptables rule is not added. If the chain does not exist, it is not created.

    OPENSHIFT_HA_CHECK_SCRIPT

    The full path name in the pod file system of a script that is periodically run to verify the application is operating.

    OPENSHIFT_HA_CHECK_INTERVAL

    2

    The period, in seconds, that the check script is run.

    OPENSHIFT_HA_NOTIFY_SCRIPT

    The full path name in the pod file system of a script that is run whenever the state changes.

    preempt_nodelay 300

    The strategy for handling a new higher priority host. The nopreempt strategy does not move master from the lower priority host to the higher priority host.

    Configuring IP failover

    As a cluster administrator, you can configure IP failover on an entire cluster, or on a subset of nodes, as defined by the label selector. You can also configure multiple IP failover deployment configurations in your cluster, where each one is independent of the others.

    The IP failover deployment configuration ensures that a failover pod runs on each of the nodes matching the constraints or the label used.

    This pod runs Keepalived, which can monitor an endpoint and use Virtual Router Redundancy Protocol (VRRP) to fail over the virtual IP (VIP) from one node to another if the first node cannot reach the service or endpoint.

    For production use, set a selector that selects at least two nodes, and set replicas equal to the number of selected nodes.

    Prerequisites

    • You are logged in to the cluster with a user with cluster-admin privileges.

    • You created a pull secret.

    Procedure

    1. Create an IP failover service account:

    2. Update security context constraints (SCC) for hostNetwork:

      1. $ oc adm policy add-scc-to-user privileged -z ipfailover
      2. $ oc adm policy add-scc-to-user hostnetwork -z ipfailover
    3. Create a deployment YAML file to configure IP failover:

      Example deployment YAML for IP failover configuration

      1. apiVersion: apps/v1
      2. kind: Deployment
      3. metadata:
      4. name: ipfailover-keepalived (1)
      5. labels:
      6. ipfailover: hello-openshift
      7. spec:
      8. strategy:
      9. type: Recreate
      10. replicas: 2
      11. selector:
      12. matchLabels:
      13. ipfailover: hello-openshift
      14. template:
      15. metadata:
      16. labels:
      17. ipfailover: hello-openshift
      18. spec:
      19. serviceAccountName: ipfailover
      20. privileged: true
      21. hostNetwork: true
      22. nodeSelector:
      23. node-role.kubernetes.io/worker: ""
      24. containers:
      25. - name: openshift-ipfailover
      26. image: quay.io/openshift/origin-keepalived-ipfailover
      27. ports:
      28. - containerPort: 63000
      29. hostPort: 63000
      30. imagePullPolicy: IfNotPresent
      31. securityContext:
      32. privileged: true
      33. volumeMounts:
      34. - name: lib-modules
      35. mountPath: /lib/modules
      36. readOnly: true
      37. - name: host-slash
      38. mountPath: /host
      39. readOnly: true
      40. mountPropagation: HostToContainer
      41. - name: etc-sysconfig
      42. mountPath: /etc/sysconfig
      43. readOnly: true
      44. - name: config-volume
      45. mountPath: /etc/keepalive
      46. env:
      47. - name: OPENSHIFT_HA_CONFIG_NAME
      48. value: "ipfailover"
      49. - name: OPENSHIFT_HA_VIRTUAL_IPS (2)
      50. value: "1.1.1.1-2"
      51. - name: OPENSHIFT_HA_VIP_GROUPS (3)
      52. value: "10"
      53. - name: OPENSHIFT_HA_NETWORK_INTERFACE (4)
      54. - name: OPENSHIFT_HA_MONITOR_PORT (5)
      55. value: "30060"
      56. - name: OPENSHIFT_HA_VRRP_ID_OFFSET (6)
      57. value: "0"
      58. - name: OPENSHIFT_HA_REPLICA_COUNT (7)
      59. value: "2" #Must match the number of replicas in the deployment
      60. - name: OPENSHIFT_HA_USE_UNICAST
      61. #- name: OPENSHIFT_HA_UNICAST_PEERS
      62. #value: "10.0.148.40,10.0.160.234,10.0.199.110"
      63. - name: OPENSHIFT_HA_IPTABLES_CHAIN (8)
      64. value: "INPUT"
      65. #- name: OPENSHIFT_HA_NOTIFY_SCRIPT (9)
      66. # value: /etc/keepalive/mynotifyscript.sh
      67. - name: OPENSHIFT_HA_CHECK_SCRIPT (10)
      68. value: "/etc/keepalive/mycheckscript.sh"
      69. - name: OPENSHIFT_HA_PREEMPTION (11)
      70. value: "preempt_delay 300"
      71. - name: OPENSHIFT_HA_CHECK_INTERVAL (12)
      72. value: "2"
      73. livenessProbe:
      74. initialDelaySeconds: 10
      75. exec:
      76. command:
      77. - pgrep
      78. - keepalived
      79. volumes:
      80. - name: lib-modules
      81. hostPath:
      82. path: /lib/modules
      83. - name: host-slash
      84. hostPath:
      85. path: /
      86. - name: etc-sysconfig
      87. hostPath:
      88. path: /etc/sysconfig
      89. # config-volume contains the check script
      90. # created with `oc create configmap keepalived-checkscript --from-file=mycheckscript.sh`
      91. - configMap:
      92. defaultMode: 0755
      93. name: keepalived-checkscript
      94. name: config-volume
      95. imagePullSecrets:
      96. - name: openshift-pull-secret (13)

    About virtual IP addresses

    Keepalived manages a set of virtual IP addresses (VIP). The administrator must make sure that all of these addresses:

    • Are accessible on the configured hosts from outside the cluster.

    • Are not used for any other purpose within the cluster.

    Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node serves the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.

    Each VIP in the set may end up being served by a different node.

    Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.

    When a check script is not provided, a simple default script is run that tests the TCP connection. This default test is suppressed when the monitor port is 0.

    Each IP failover pod manages a Keepalived daemon that manages one or more virtual IPs (VIP) on the node where the pod is running. The Keepalived daemon keeps the state of each VIP for that node. A particular VIP on a particular node may be in master, backup, or fault state.

    When the check script for that VIP on the node that is in master state fails, the VIP on that node enters the fault state, which triggers a renegotiation. During renegotiation, all VIPs on a node that are not in the fault state participate in deciding which node takes over the VIP. Ultimately, the VIP enters the master state on some node, and the VIP stays in the backup state on the other nodes.

    When a node with a VIP in backup state fails, the VIP on that node enters the fault state. When the check script passes again for a VIP on a node in the fault state, the VIP on that node exits the fault state and negotiates to enter the master state. The VIP on that node may then enter either the master or the backup state.

    As cluster administrator, you can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:

    • $1 - group or instance

    • $2 - Name of the group or instance

    • $3 - The new state: master, backup, or fault

    The check and notify scripts run in the IP failover pod and use the pod file system, not the host file system. However, the IP failover pod makes the host file system available under the /hosts mount path. When configuring a check or notify script, you must provide the full path to the script. The recommended approach for providing the scripts is to use a config map.

    The full path names of the check and notify scripts are added to the Keepalived configuration file, _/etc/keepalived/keepalived.conf, which is loaded every time Keepalived starts. The scripts can be added to the pod with a config map as follows.

    Prerequisites

    • You installed the OpenShift CLI (oc).

    • You are logged in to the cluster with a user with cluster-admin privileges.

    Procedure

    1. Create the desired script and create a config map to hold it. The script has no input arguments and must return 0 for OK and 1 for fail.

      The check script, _mycheckscript.sh_:

      1. #!/bin/bash
      2. # Whatever tests are needed
      3. # E.g., send request and verify response
      4. exit 0
    2. Create the config map:

      1. $ oc create configmap mycustomcheck --from-file=mycheckscript.sh
    3. Add the script to the pod. The defaultMode for the mounted config map files must able to run by using oc commands or by editing the deployment configuration. A value of 0755, 493 decimal, is typical:

      1. $ oc set env deploy/ipfailover-keepalived \
      2. OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh
      1. $ oc set volume deploy/ipfailover-keepalived --add --overwrite \
      2. --name=config-volume \
      3. --mount-path=/etc/keepalive \
      4. --source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'

      The oc set env command is whitespace sensitive. There must be no whitespace on either side of the = sign.

      You can alternatively edit the ipfailover-keepalived deployment configuration:

      1In the spec.container.env field, add the OPENSHIFT_HA_CHECK_SCRIPT environment variable to point to the mounted script file.
      2Add the spec.container.volumeMounts field to create the mount point.
      3Add a new spec.volumes field to mention the config map.
      4This sets run permission on the files. When read back, it is displayed in decimal, 493.

    Configuring VRRP preemption

    When a Virtual IP (VIP) on a node leaves the fault state by passing the check script, the VIP on the node enters the backup state if it has lower priority than the VIP on the node that is currently in the master state. However, if the VIP on the node that is leaving fault state has a higher priority, the preemption strategy determines its role in the cluster.

    The nopreempt strategy does not move master from the lower priority VIP on the host to the higher priority VIP on the host. With preempt_delay 300, the default, Keepalived waits the specified 300 seconds and moves master to the higher priority VIP on the host.

    Prerequisites

    • You installed the OpenShift CLI (oc).

    Procedure

    • To specify preemption enter oc edit deploy ipfailover-keepalived to edit the router deployment configuration:

      1. $ oc edit deploy ipfailover-keepalived
      1. spec:
      2. containers:
      3. - env:
      4. - name: OPENSHIFT_HA_PREEMPTION (1)
      5. value: preempt_delay 300
      6. ...

    About VRRP ID offset

    Each IP failover pod managed by the IP failover deployment configuration, 1 pod per node or replica, runs a Keepalived daemon. As more IP failover deployment configurations are configured, more pods are created and more daemons join into the common Virtual Router Redundancy Protocol (VRRP) negotiation. This negotiation is done by all the Keepalived daemons and it determines which nodes service which virtual IPs (VIP).

    Internally, Keepalived assigns a unique vrrp-id to each VIP. The negotiation uses this set of vrrp-ids, when a decision is made, the VIP corresponding to the winning vrrp-id is serviced on the winning node.

    Therefore, for every VIP defined in the IP failover deployment configuration, the IP failover pod must assign a corresponding vrrp-id. This is done by starting at OPENSHIFT_HA_VRRP_ID_OFFSET and sequentially assigning the vrrp-ids to the list of VIPs. The vrrp-ids can have values in the range 1..255.

    When there are multiple IP failover deployment configurations, you must specify OPENSHIFT_HA_VRRP_ID_OFFSET so that there is room to increase the number of VIPs in the deployment configuration and none of the vrrp-id ranges overlap.

    IP failover management is limited to 254 groups of Virtual IP (VIP) addresses. By default OKD assigns one IP address to each group. You can use the OPENSHIFT_HA_VIP_GROUPS variable to change this so multiple IP addresses are in each group and define the number of VIP groups available for each Virtual Router Redundancy Protocol (VRRP) instance when configuring IP failover.

    Grouping VIPs creates a wider range of allocation of VIPs per VRRP in the case of VRRP failover events, and is useful when all hosts in the cluster have access to a service locally. For example, when a service is being exposed with an ExternalIP.

    As a rule for failover, do not limit services, such as the router, to one specific host. Instead, services should be replicated to each host so that in the case of IP failover, the services do not have to be recreated on the new host.

    If you are using OKD health checks, the nature of IP failover and groups means that all instances in the group are not checked. For that reason, the Kubernetes health checks must be used to ensure that services are live.

    Prerequisites

    • You are logged in to the cluster with a user with cluster-admin privileges.

    Procedure

    • To change the number of IP addresses assigned to each group, change the value for the OPENSHIFT_HA_VIP_GROUPS variable, for example:

      Example Deployment YAML for IP failover configuration

      1. ...
      2. spec:
      3. env:
      4. - name: OPENSHIFT_HA_VIP_GROUPS (1)
      5. value: "3"
      6. ...
      1If OPENSHIFT_HA_VIP_GROUPS is set to 3 in an environment with seven VIPs, it creates three groups, assigning three VIPs to the first group, and two VIPs to the two remaining groups.

    If the number of groups set by OPENSHIFT_HA_VIP_GROUPS is fewer than the number of IP addresses set to fail over, the group contains more than one IP address, and all of the addresses move as a single unit.

    High availability For ingressIP

    In non-cloud clusters, IP failover and ingressIP to a service can be combined. The result is high availability services for users that create services using ingressIP.

    The approach is to specify an ingressIPNetworkCIDR range and then use the same range in creating the ipfailover configuration.

    Because IP failover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR needs to be /24 or smaller.

    Removing IP failover

    When IP failover is initially configured, the worker nodes in the cluster are modified with an iptables rule that explicitly allows multicast packets on 224.0.0.18 for Keepalived. Because of the change to the nodes, removing IP failover requires running a job to remove the iptables rule and removing the virtual IP addresses used by Keepalived.

    Procedure

    1. Optional: Identify and delete any check and notify scripts that are stored as config maps:

      1. Identify whether any pods for IP failover use a config map as a volume:

        1. $ oc get pod -l ipfailover \
        2. -o jsonpath="\
        3. {range .items[?(@.spec.volumes[*].configMap)]}
        4. {'Namespace: '}{.metadata.namespace}
        5. {'Pod: '}{.metadata.name}
        6. {'Volumes that use config maps:'}
        7. {range .spec.volumes[?(@.configMap)]} {'volume: '}{.name}
        8. {'configMap: '}{.configMap.name}{'\n'}{end}
        9. {end}"

        Example output

        1. Namespace: default
        2. Pod: keepalived-worker-59df45db9c-2x9mn
        3. Volumes that use config maps:
        4. volume: config-volume
        5. configMap: mycustomcheck
      2. If the preceding step provided the names of config maps that are used as volumes, delete the config maps:

        1. $ oc delete configmap <configmap_name>
    2. Identify an existing deployment for IP failover:

      1. $ oc get deployment -l ipfailover

      Example output

    3. Delete the deployment:

      1. $ oc delete deployment <ipfailover_deployment_name>
    4. Remove the ipfailover service account:

      1. $ oc delete sa ipfailover
    5. Run a job that removes the IP tables rule that was added when IP failover was initially configured:

      1. Create a file such as remove-ipfailover-job.yaml with contents that are similar to the following example:

        1. apiVersion: batch/v1
        2. kind: Job
        3. metadata:
        4. generateName: remove-ipfailover-
        5. labels:
        6. app: remove-ipfailover
        7. spec:
        8. template:
        9. metadata:
        10. name: remove-ipfailover
        11. spec:
        12. containers:
        13. - name: remove-ipfailover
        14. image: quay.io/openshift/origin-keepalived-ipfailover:4.13
        15. command: ["/var/lib/ipfailover/keepalived/remove-failover.sh"]
        16. nodeSelector:
        17. kubernetes.io/hostname: <host_name> (1)
        18. restartPolicy: Never
      2. Run the job:

        1. $ oc create -f remove-ipfailover-job.yaml

        Example output

        1. job.batch/remove-ipfailover-2h8dm created

    Verification

    • Confirm that the job removed the initial configuration for IP failover.

      1. $ oc logs job/remove-ipfailover-2h8dm

      Example output

      1. remove-failover.sh: OpenShift IP Failover service terminating.
      2. - Removing ip_vs module ...
      3. - Releasing VIPs (interface eth0) ...