SCM High Availability

    To avoid any single point of failure the metadata-manager nodes also should have a HA setup.

    Both Ozone Manager and Storage Container Manager supports HA. In this mode the internal state is replicated via RAFT (with Apache Ratis)

    This document explains the HA setup of Storage Container Manager (SCM), please check this page for HA setup of Ozone Manager (OM). While they can be setup for HA independently, a reliable, full HA setup requires enabling HA for both services.

    HA mode of Storage Container Manager can be enabled with the following settings in :

    One Ozone configuration (ozone-site.xml) can support multiple SCM HA node set, multiple Ozone clusters. To select between the available SCM nodes a logical name is required for each of the clusters which can be resolved to the IP addresses (and domain names) of the Storage Container Managers.

    This logical name is called serviceId and can be configured in the ozone-site.xml

    Most of the time you need to set only the values of your current cluster:

    1. <property>
    2. <name>ozone.scm.service.ids</name>
    3. <value>cluster1</value>
    4. </property>

    For each of the defined serviceId a logical configuration name should be defined for each of the servers

    1. <property>
    2. <name>ozone.scm.nodes.cluster1</name>
    3. <value>scm1,scm2,scm3</value>
    4. </property>

    The defined prefixes can be used to define the address of each of the SCM services:

    1. <name>ozone.scm.address.cluster1.scm1</name>
    2. <value>host1</value>
    3. </property>
    4. <property>
    5. <name>ozone.scm.address.cluster1.scm2</name>
    6. <value>host2</value>
    7. <property>
    8. <name>ozone.scm.address.cluster1.scm3</name>
    9. <value>host3</value>
    10. </property>

    For reliable HA support choose 3 independent nodes to form a quorum.

    Bootstrap

    The initialization of the first SCM-HA node is the same as a none-HA SCM:

    1. bin/ozone scm --bootstrap

    In some environment – such as containerized / K8s environment – we need to have a common, unified way to initialize SCM HA quorum. As a remained, the standard initialization flow is the following:

    1. On the first, “primordial” node, call scm --init
    2. On second/third nodes call scm --bootstrap

    This can be changed with using ozone.scm.primordial.node.id. You can define the primordial node. After setting this node, you should execute both scm --init and scm --bootstrap on all nodes.

    Based on the ozone.scm.primordial.node.id, the init process will be ignored on the second/third nodes and bootstrap process will be ignored on all nodes except the primordial one.

    SCM HA Security

    In a secure SCM HA cluster on the SCM where we perform init, we call this SCM as a primordial SCM. Primordial SCM starts root-CA with self-signed certificates and is used to issue a signed certificate to itself and other bootstrapped SCM’s. Only primordial SCM can issue signed certificates for other SCM’s. So, primordial SCM has a special role in the SCM HA cluster, as it is the only one that can issue certificates to SCM’s.

    The primordial SCM takes a root-CA role, which signs all SCM instances with a sub-CA certificate. The sub-CA certificates are used by SCM to sign certificates for OM/Datanodes.

    When bootstrapping a SCM, it gets a signed certificate from the primary SCM and starts sub-CA.

    Sub-CA on the SCM’s are used to issue signed certificates for OM/DN in the cluster. Only the leader SCM issues a certificate to OM/DN.

    1. <property>
    2. <config>ozone.security.enable</config>
    3. <value>true</value>
    4. </property>
    5. <config>hdds.grpc.tls.enabled</config>
    6. </property>

    Above configs are needed in addition to normal SCM HA configuration.

    Primordial SCM is determined from the config ozone.scm.primordial.node.id. The value for this can be node id or hostname of the SCM. If the config is not defined, the node where init is run is considered as the primordial SCM.

    1. bin/ozone scm --init

    This will set up a public,private key pair and self-signed certificate for root CA and also generate public, private key pair and CSR to get a signed certificate for sub-CA from root CA.

    Note: Make sure to run –init only on one of the SCM host if primordial SCM is not defined. Bring up other SCM’s using –bootstrap.

    1. When primordial SCM is down, new SCM’s cannot be bootstrapped and join the quorum.
    2. Secure cluster upgrade to ratis-enable secure cluster is not supported.

    SCM HA uses Apache Ratis to replicate state between the members of the SCM HA quorum. Each node maintains the block management metadata in local RocksDB.

    This replication process is a simpler version of OM HA replication process as it doesn’t use any double buffer (as the overall db thourghput of SCM requests are lower)

    Datanodes are sending all the reports (Container reports, Pipeline reports…) to all the Datanodes parallel. Only the leader node can assign/create new containers, and only the leader node sends command back to the Datanodes.

    Verify SCM HA setup

    After starting an SCM-HA it can be validated if the SCM nodes are forming one single quorum instead of 3 individual SCM nodes.

    First, check if all the SCM nodes store the same ClusterId metadata:

    1. cat /data/metadata/scm/current/VERSION

    ClusterId is included in the VERSION file and should be the same in all the SCM nodes:

    1. #Tue Mar 16 10:19:33 UTC 2021
    2. cTime=1615889973116
    3. clusterID=CID-130fb246-1717-4313-9b62-9ddfe1bcb2e7
    4. nodeType=SCM
    5. scmUuid=e6877ce5-56cd-4f0b-ad60-4c8ef9000882
    6. layoutVersion=0

    You can also create data and double check with ozone debug tool if all the container metadata is replicated.

    1. bin/ozone freon randomkeys --numOfVolumes=1 --numOfBuckets=1 --numOfKeys=10000 --keySize=524288 --replicationType=RATIS --numOfThreads=8 --factor=THREE --bufferSize=1048576
    2. // use debug ldb to check scm db on all the machines
    3. bin/ozone debug ldb --db=/tmp/metadata/scm.db/ ls
    4. bin/ozone debug ldb --db=/tmp/metadata/scm.db/ scan --with-keys --column_family=containers

    SCM HA can be turned on on any Ozone cluster. First enable Ratis (ozone.scm.ratis.enable) and configure only one node for the Ratis ring ( should have one element).

    Start the cluster and test if it works well.

    If everything is fine, you can extend the cluster configuration with multiple nodes, restart SCM node, and initialize the additional nodes with scm --bootstrap command.