Metadata Design Document

FE: Frontend, the front-end node of Doris. Mainly responsible for receiving and returning client requests, metadata, cluster management, query plan generation and so on.
BE: Backend, the back-end node of Doris. Mainly responsible for data storage and management, query plan execution and other work.
bdbje: Oracle Berkeley DB Java Edition (opens new window). In Doris, we use bdbje to persist metadata operation logs and high availability of FE.

As shown above, Doris’s overall architecture is divided into two layers. Multiple FEs form the first tier, providing lateral expansion and high availability of FE. Multiple BEs form the second layer, which is responsible for data storage and management. This paper mainly introduces the design and implementation of metadata in FE layer.

There are two different kinds of FE nodes: follower and observer. Leader election and data synchronization are taken among FE nodes by bdbje ().
The observer node only synchronizes metadata from the leader node and does not participate in the election. It can be scaled horizontally to provide the extensibility of metadata reading services.

Doris’s metadata is in full memory. A complete metadata image is maintained in each FE memory. Within Baidu, a cluster of 2,500 tables and 1 million fragments (3 million copies) occupies only about 2GB of metadata in memory. (Of course, the memory overhead for querying intermediate objects and various job information needs to be estimated according to the actual situation. However, it still maintains a low memory overhead.

At the same time, metadata is stored in the memory as a whole in a tree-like hierarchical structure. By adding auxiliary structure, metadata information at all levels can be accessed quickly.

The following figure shows the contents stored in Doris meta-information.

User data information. Including database, table Schema, fragmentation information, etc.
All kinds of job information. For example, import jobs, Clone jobs, SchemaChange jobs, etc.
User and permission information.
Cluster and node information.

The data flow of metadata is as follows:

Only leader FE can write metadata. After modifying leader’s memory, the write operation serializes into a log and writes to bdbje in the form of key-value. The key is a continuous integer, and as log id, value is the serialized operation log.
After the log is written to bdbje, bdbje copies the log to other non-leader FE nodes according to the policy (write most/write all). The non-leader FE node modifies its metadata memory image by playback of the log, and completes the synchronization with the metadata of the leader node.
When the number of log bars of the leader node reaches the threshold (default 10W bars), the checkpoint thread is started. Checkpoint reads existing image files and subsequent logs and replays a new mirror copy of metadata in memory. The copy is then written to disk to form a new image. The reason for this is to regenerate a mirror copy instead of writing an existing image to an image, mainly considering that the write operation will be blocked during writing the image plus read lock. So every checkpoint takes up twice as much memory space.
After the image file is generated, the leader node notifies other non-leader nodes that a new image has been generated. Non-leader actively pulls the latest image files through HTTP to replace the old local files.
The logs in bdbje will be deleted regularly after the image is completed.

The metadata directory is specified by the FE configuration item `meta_dir’.
Data storage directory for bdbje under directory.

Image.ckpt is the image file being written. If it is successfully written, it will be renamed image.[logid] and replaced with the original image file.
Thecluster_id is recorded in the file. Cluster_id uniquely identifies a Doris cluster. It is a 32-bit integer randomly generated at the first startup of leader. You can also specify a cluster ID through the Fe configuration item `cluster_id’.
The role of FE itself recorded in the ROLE file. There are only FOLLOWER and OBSERVER. Where FOLLOWER denotes FE as an optional node. (Note: Even the leader node has a role of FOLLOWER)

FE starts for the first time. If the startup script does not add any parameters, it will try to start as leader. You will eventually see in the FE startup log.
FE starts for the first time. If the -helper parameter is specified in the startup script and points to the correct leader FE node, the FE first asks the leader node about its role (ROLE) and cluster_id through http. Then pull up the latest image file. After reading image file and generating metadata image, start bdbje and start bdbje log synchronization. After synchronization is completed, the log after image file in bdbje is replayed, and the final metadata image generation is completed.
FE is not the first startup. If the startup script does not add any parameters, it will determine its identity according to the ROLE information stored locally. At the same time, according to the cluster information stored in the local bdbje, the leader information is obtained. Then read the local image file and the log in bdbje to complete the metadata image generation. (If the roles recorded in the local ROLE are inconsistent with those recorded in bdbje, an error will be reported.)
FE is not the first boot, and the -helper parameter is specified in the boot script. Just like the first process started, the leader role is asked first. But it will be compared with the ROLE stored by itself. If they are inconsistent, they will report errors.

Metadata Read-Write and Synchronization

Users can use Mysql to connect any FE node to read and write metadata. If the connection is a non-leader node, the node forwards the write operation to the leader node. When the leader is successfully written, it returns a current and up-to-date log ID of the leader. Later, the non-leader node waits for the log ID it replays to be larger than the log ID it returns to the client before returning the message that the command succeeds. This approach guarantees Read-Your-Write semantics for any FE node.
The metadata of each FE only guarantees the final consistency. Normally, inconsistent window periods are only milliseconds. We guarantee the monotonous consistency of metadata access in the same session. But if the same client connects different FEs, metadata regression may occur. (But for batch update systems, this problem has little impact.)

When the leader node goes down, the rest of the followers will immediately elect a new leader node to provide services.
Metadata cannot be written when most follower nodes are down. When metadata is not writable, if a write operation request occurs, the current process is that the FE process exits. This logic will be optimized in the future, and read services will still be provided in the non-writable state.
The downtime of observer node will not affect the state of any other node. It also does not affect metadata reading and writing at other nodes.