Disk subsystem of a cluster aka YDB BlobStorage

Lets you store blobs (binary fragments from 1 byte to 10 megabytes in size) with a unique identifier.

Each blob has a 192-bit ID consisting of the following fields (in the order used for sorting):

TabletId (64 bits): ID of the blob owner tablet.
Generation (32 bits): Generation in which the tablet that captured this blob was run.
Step (32 bits): Blob group internal ID within Generation.
Cookie (24 bits): ID to use if Step is insufficient.
CrcMode (2 bits): Selects a mode for redundant blob integrity verification at the BlobStorage level.
PartId (4 bits): Fragment number when using blob erasure coding. At the “BlobStorage <-> tablet” communication level, this parameter is always 0 referring to the entire blob.

Two blobs are considered different if at least one of the first five parameters (TabletId, Channel, Generation, Step, or Cookie) differs in their IDs. So it is impossible to write two blobs that only differ in BlobSize and/or CrcMode.

For debugging purposes, there is string blob ID formatting that has interactions , for example, .

When writing a blob, the tablet selects the Channel, Step, and Cookie parameters. TabletId is fixed and must point to the tablet performing the write operation, while Generation must indicate the generation that the tablet performing the operation is running in.

When performing reads, the blob ID is specified, which can be arbitrary, but preferably preset.

Blobs are written in a logical entity called group. A special actor called DS proxy is created on every node for each group that is written to. This actor is responsible for performing all operations related to the group. The actor is created automatically through the NodeWarden service that will be described below.

This scheme is shown in the figure below.

VDisks from different groups are shown as multicolored squares; one color stands for one group.

A group can be treated as a set of VDisks:

Each VDisk within a group has a sequence number, and disks are numbered 0 to N-1, where N is the number of disks in the group.

In addition, disks are combined into fail domains, and fail domains are combined into fail realms. As a rule, each fail domain comprises exactly one disk (although, in theory, it may have more but this has found no practical application), and multiple fail realms are only used for groups that host their data in three datacenters at once. Thus, in addition to a group sequence number, each VDisk is assigned an ID that consists of a fail realm index, the index that a fail domain has in a fail realm, and the index that a VDisk has in the fail domain. In string form, this ID is written as .

Each PDisk has an ID that consists of the number of the node that it is running on and the internal number of the PDisk inside this node. This ID is usually written as NodeId:PDiskId. For example, 1:1000. If you know the PDisk ID, you can calculate the service ActorId of this disk and send it a message.

Each VDisk runs on top a specific PDisk and has a slot ID comprising three fields (NodeID:PDiskId:VSlotId) as well as the above-mentioned VDisk ID. Strictly speaking, there are different concepts: a slot is a reserved location on a PDISK occupied by a VDisk while a VDisk is an element of a group that occupies a certain slot and performs operations with the slot. Similarly to PDisks, if you know the slot ID, you can calculate the service ActorId of the running VDisk and send it a message. To send messages from the DS proxy to the VDisk, an intermediate actor called BS_QUEUE is used.

The composition of each group is not constant. It may change while the system is running. Hence the concept of a group generation. Each “GroupId:GroupGeneration” pair corresponds to a fixed set of slots (a vector that consists of N slot IDs, where N is equal to group size) that stores the data of an entire group. Group generation is not to be confused with tablet generation since they are not in any way related.

As a rule, groups of two adjacent generations differ by no more than one slot.

A special concept of a subgroup is introduced for each blob. It is an ordered subset of group disks with a strictly constant number of elements that will store the blob’s data and that depends on the encoding type (the number of elements in a group must be the same or greater). For single-datacenter groups with conventional encoding, this subset is selected as the first N elements of a cyclic disk permutation in the group, where the permutation depends on the BlobId hash.

Each disk in the subgroup corresponds to a disk in the group, but is limited by the allowed number of stored blobs. For example, for block-4-2 encoding with four data parts and two parity parts, the functional purpose of the disks in a subgroup is as follows:

In this case, PartID=1..4 corresponds to the data parts (which are obtained by splitting the original blob into 4 equal parts), and PartID=5..6 are parity fragments. Disks numbered 6 and 7 in the subgroup are called handoff disks. Any part, either one or more, can be written to them. Disks 0..5 can only store the corresponding blob parts.