RocksDB has a built-in mechanism to overcome these limitations of POSIX file system by keeping a transactional log of RocksDB state changes called the MANIFEST. MANIFEST is used to restore RocksDB to the latest known consistent state on a restart.

  • MANIFEST refers to the system that keeps track of RocksDB state changes in a transactional log
  • Manifest log refers to an individual log file that contains RocksDB state snapshot/edits
  • CURRENT refers to the latest manifest log

MANIFEST is a transactional log of the RocksDB state changes. MANIFEST consists of - manifest log files and latest manifest file pointer. Manifest logs are rolling log files named MANIFEST-(seq number). The sequence number is always increasing. CURRENT is a special file that identifies the latest manifest log file.

On system (re)start, the latest manifest log contains the consistent state of RocksDB. Any subsequent change to RocksDB state is logged to the manifest log file. When a manifest log file exceeds a certain size, a new manifest log file is created with the snapshot of the RocksDB state. The latest manifest file pointer is updated and the file system is synced. Upon successful update to CURRENT file, the redundant manifest logs are purged.

A certain state of RocksDB at any given time is referred to as a version (aka snapshot). Any modification to the version is considered a version edit. A version (or RocksDB state snapshot) is constructed by joining a sequence of version-edits. Essentially, a manifest log file is a sequence of version edits.

  1. version = { version-edit* }
  2. manifest-log-file = { version, version-edit* }
  3. = { version-edit* }

Manifest log is a sequence of version edit records. The version edit record type is identified by the edit identification number.

We use the following datatypes for encoding/decoding.

Simple data types

  1. VarX - Variable character encoding of intX
  2. FixedX - Fixed character encoding of intX

Complex data types

  1. String - Length prefixed string data
  2. +-----------+--------------------+
  3. | size (n) | content of string |
  4. +-----------+--------------------+
  5. |<- Var32 ->|<-- n -->|

Version edit records have the following format. The decoder identifies the record type using the record identification number.

  1. +-------------+------ ......... ----------+
  2. | Record ID | Variable size record data |
  3. +-------------+------ .......... ---------+
  4. <-- Var32 --->|<-- varies by type -->

Comparator edit record:

  1. Captures the comparator name
  2. +-------------+----------------+
  3. | kComparator | data |
  4. +-------------+----------------+
  5. <-- Var32 --->|<-- String -->|

Log number edit record:

  1. Latest WAL log file number
  2. +-------------+----------------+
  3. +-------------+----------------+
  4. <-- Var32 --->|<-- Var64 -->|

Previous File Number edit record:

Next File Number edit record:

  1. Next manifest file number
  2. +------------------+----------------+
  3. | kNextFileNumber | log number |
  4. +------------------+----------------+

Last Sequence Number edit record:

  1. Last sequence number of RocksDB
  2. +------------------+----------------+
  3. | kLastSequence | log number |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var64 -->|

Max Column Family edit record:

  1. Adjust the maximum number of family columns allowed.
  2. +---------------------+----------------+
  3. | kMaxColumnFamily | log number |
  4. +---------------------+----------------+
  5. <-- Var32 --->|<-- Var32 -->|

Deleted File edit record:

  1. Mark a file as deleted from database.
  2. +-----------------+-------------+--------------+
  3. | kDeletedFile | level | file number |
  4. +-----------------+-------------+--------------+
  5. <-- Var32 --->|<-- Var32 -->|<-- Var64 -->|

New File edit record:

Mark a file as newly added to the database and provide RocksDB meta information.

  • File edit record with compaction information
  1. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  2. | kNewFile4 | level | file number | file size | smallest_key | largest_key | smallest_seqno | largest_seq_no |
  3. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  4. |<-- var32 -->|<-- var32 -->|<-- var64 -->|<- var64 ->|<-- String -->|<-- String -->|<-- var64 -->|<-- var64 -->|
  5. +--------------+------------------+---------+------+----------------+--------------------+---------+------------+
  6. | CustomTag1 | Field 1 size n1 | field1 | ... | CustomTag(m) | Field m size n(m) | field(m)| kTerminate |
  7. +--------------+------------------+---------+------+----------------+--------------------+---------+------------+

Several optional customized fields are supported: kNeedCompaction: Whether the file should be compacted to the next level. kMinLogNumberToKeepHack: WAL file number that is still in need for recovery after this entry. kPathId: The Path ID in which the file lives. This can’t be ignored by an old release.

  • File edit record backward compatible
  1. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  2. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  3. <-- var32 -->|<-- var32 -->|<-- var64 -->|<- var64 ->|<-- String -->|<-- String -->|<-- var64 -->|<-- var64 -->|
  • File edit record with path information

Column family status edit record:

  1. Note the status of column family feature (enabled/disabled)
  2. +------------------+----------------+
  3. | kColumnFamily | 0/1 |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var32 -->|

Column family add edit record:

  1. Add a column family
  2. +---------------------+----------------+
  3. | kColumnFamilyAdd | cf name |
  4. +---------------------+----------------+
  5. <-- Var32 --->|<-- String -->|

Column family drop edit record:

  1. Drop all column family
  2. +---------------------+
  3. | kColumnFamilyDrop |
  4. +---------------------+
  5. <-- Var32 --->|

Record as part of an atomic group (since RocksDB 5.16):

There are cases in which ‘all-or-nothing’, multi-column-family version change is desirable. For example, ensures either all or none of the column families get flushed successfully, multiple column families external SST ingestion guarantees that either all or none of the column families ingest SSTs successfully. Since writing multiple version edits is not atomic, we need to take extra measure to achieve atomicity (not necessarily instantaneity from the user’s perspective). Therefore we introduce a new record field kInAtomicGroup to indicate that this record is part of a group of version edits that follow the ‘all-or-none’ property. The format is as follows.

  1. +-----------------+--------------------------------------------+
  2. | kInAtomicGroup | #remaining version edits in the same group |
  3. +-----------------+--------------------------------------------+
  4. |<--- Var32 ----->|<----------------- Var32 ------------------>|

During recovery, RocksDB buffers version edits of an atomic group without applying them until the last version edit of the atomic group is decoded successfully from the MANIFEST file. Then RocksDB applies all the version edits in this atomic group. RocksDB never applies partial atomic groups.

We reserved a special bit in record type. If the bit is set, it can be safely ignored. And the safely ignorable record has a standard general format:

  1. +---------+----------------+----------------+
  2. | kTag | field length n | fields ... |
  3. +--------------------------+----------------+
  4. <- Var32->|<-- var32 -->|<--- n >|

This is introduced in 6.0 and no customized ignoreable record created yet.

DB ID edit record: introduced since RocksDB 6.5. If options.write_dbid_to_manifest is true, then RocksDB writes the DB ID edit record to the MANIFEST file, besides storing in the IDENTITY file.

  1. +-----------+------------+
  2. | kDbId | db id |