Implementation Details - How we keep track of live SST files - 《RocksDB v6.14 Document》

The list of files in an LSM tree is kept in a data structure called version. In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one “current” version that represents the files in the up-to-date LSM tree. New get requests or new iterators will use the current version through the whole read process or life cycle of iterator. All versions used by get or iterators need to be kept. An out-of-date version that is not used by any get or iterator needs to be dropped. All files not used by any other version need to be deleted. For example,

If we start with a version with three files:

and now an iterator is created with it:

files on disk: f1, f2, f3

Now a compaction happened compact f2, f3 and f4 into a new file f5 with a new v3 created:

v3={f1, f5} (current)
v1={f1, f2, f3} (used by iterator1)
files on disk: f1, f2, f3, f4, f5

Now v2 is neither up-to-date, nor used by anyone, so it qualifies to be removed, together with f4. While v1 still cannot be removed for it is still needed by iterator1:

Assuming now iterator1 is destroyed:

v1={f1, f2, f3}
files on disk: f1, f2, f3, f5

This logic is implemented using reference counts. Both of an SST file and a version have a reference count. While we create a , we incremented the reference counts for all files. If a version is not needed, all files’ of the version have their reference counts decremented. If a file’s reference count drops to 0, the file can be deleted.

In a similar way, each version has a reference count. When a version is created, it is an up-to-date one, so it has reference count 1. If the version is not up-to-date anymore, its reference count is decremented. Anyone who needs to work on the version has its reference count incremented by 1, and decremented by 1 when finishing using it. When a version’s reference count is 0, it should be removed. Either a version is up-to-date or someone is using it, its reference count is not 0, so it will be kept.

Sometimes a reader holds reference of a version directly, like the source version for a compaction. More often, a reader holds it indirectly through a data structure called super version, which holds reference counts for list of mem tables and a version — a whole view of the DB. A reader only needs to increase and decrease one reference count, while it is the super version that holds the reference count of version. It also enables further optimization to avoid locking for the reference counting in most of the time. See the .