High Availability
To avoid any single point of failure the leader nodes also should have a HA setup.
- HA of Ozone Manager is implemented with the help of RAFT (Apache Ratis)
- HA of Storage Container Manager is under implementation
A single Ozone Manager uses to persiste metadata (volumes, buckets, keys) locally. HA version of Ozone Manager does exactly the same but all the data is replicated with the help of the RAFT consensus algorithm to follower Ozone Manager instances.
Client connects to the Leader Ozone Manager which process the request and schedule the replication with RAFT. When the request is replicated to all the followers the leader can return with the response.
One Ozone configuration (ozone-site.xml
) can support multiple Ozone HA cluster. To select between the available HA clusters a logical name is required for each of the clusters which can be resolved to the IP addresses (and domain names) of the Ozone Managers.
This logical name is called serviceId
and can be configured in the ozone-site.xml
<property>
<value>cluster1,cluster2</value>
</property>
For each of the defined a logical configuration name should be defined for each of the servers.
The defined prefixes can be used to define the address of each of the OM services:
<property>
<name>ozone.om.address.cluster1.om1</name>
<value>host1</value>
</property>
<property>
<value>host2</value>
<property>
<name>ozone.om.address.cluster1.om3</name>
<value>host3</value>
</property>
For example with o3fs://
Or with ofs://
:
Raft can guarantee the replication of any request if the request is persisted to the RAFT log on the majority of the nodes. To achieve high throughput with Ozone Manager, it returns with the response even if the request is persisted only to the RAFT logs.
RocksDB instance are updated by a background thread with batching transactions (so called “double buffer” as when one of the buffers is used to commit the data the other one collects all the new requests for the next commit.) To make all data available for the next request even if the background process is not yet wrote them the key data is cached in the memory.
The details of this approach discussed in a separated design doc but it’s integral part of the OM HA design.
- Ozone distribution contains an example OM HA configuration, under the
compose/ozone-om-ha
directory which can be tested with the help of .