As introduced by multiple authors and systems, there are two main types of LSM-tree compaction strategies:

  • leveled compaction, as the default compaction style in RocksDB
  • an alternative compaction strategy, sometimes called “size tiered” [1] or “tiered” [2].

The key difference between the two strategies is that leveled compaction tends to aggressively merge a smaller sorted run into a larger one, while “tiered” waits for several sorted runs with similar size and merge them together.

It is generally regarded that the second strategy provides far better write amplification with worse read amplification [2][3]. An intuitive way to think about it: in tiered storage, every time an update is compacted, it tends to be moved from a smaller sorted run to a much larger one. Every compaction is likely to make the update exponentially closer to the final sorted run, which is the largest. In leveled compaction, however, an update is compacted more as a part of the larger sorted run where a smaller sorted run is merged into, than as a part of the smaller sorted run. As a result, in most of the times an update is compacted, it is not moved to a larger sorted run, so it doesn’t make much progress towards the final largest run.

The benefit of “tiered” compaction is not without downside. The worse case number of sorted runs is far higher than leveled compaction. It may cause higher I/O costs and/or higher CPU costs during reads. The lazy nature of the compaction scheduling also makes the compaction traffic much more spiky, the number of sorted runs greatly varies over time, hence large variation of performance.

Nevertheless, RocksDB provides a Universal compaction in the “tiered” family. Users may try this compaction style if leveled compaction is not able to handle the required write rate.

[1] The term is used by Cassandra. See their doc.

[2] N. Dayan, M. Athanassoulis, and S. Idreos, “,” in ACM SIGMOD International Conference on Management of Data, 2017.

[3] https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html

Overview And the Basic Idea

When using this compaction style, all the SST files are organized as sorted runs covering the whole key ranges. One sorted run covers data generated during a time range. Different sorted runs never overlap on their time ranges. Compaction can only happen among two or more sorted runs of adjacent time ranges. The output is a single sorted run whose time range is the combination of input sorted runs. After any compaction, the condition that sorted runs never overlap on their time ranges still holds. A sorted run can be implemented as an L0 file, or a “level” in which data is stored as key range partitioned files.

The basic idea of the compaction style: with a threshold of number of sorted runs N, we only start compaction when number of sorted runs reaches N. When it happens, we try to pick files to compact so that number of sorted runs is reduced in the most economic way: (1) it starts from the smallest file; (2) one more sorted run is included if its size is no larger than the existing compaction size. The strategy assumes and itself tries to maintain that the sorted run containing more recent data is smaller than ones containing older data.

In universal style compaction, sometimes full compaction is needed. In this case, output data size is similar to input size. During compaction, both of input files and the output file need to be kept, so the DB will be temporarily double the disk space usage. Be sure to keep enough free space for full compaction.

When using Universal Compaction, if num_levels = 1, all data of the DB (or Column Family to be precise) is sometimes compacted to one single SST file. There is a limitation of size of one single SST file. In RocksDB, a block cannot exceed 4GB (to allow size to be uint32). The index block can exceed the limit if the single SST file is too big. The size of index block depends on your data. In one of our use cases, we would observe the DB to reach the limitation when the DB grows to about 250GB, using 4K data block size. To mitigate this limitation you can use .

This problem is mitigated if users set num_levels to be much larger than 1. In that case, larger “files” will be put in larger “levels” with files divided into smaller files (more explanation below). L0->L0 compaction can still happen for parallel compactions, but most likely files in L0 are much smaller.

Data Layout and Placement

As mentioned above, data is organized as sorted runs. Sorted runs are laid out by updated time of the data in it and stored as either files in L0 or a whole “level”.

Here is an example of a typical file layout:

Levels with a larger number contain older sorted run than levels of smaller numbers. In this example, there are 5 sorted runs: three files in level 0, level 4 and 5. Level 5 is the oldest sorted run, level 4 is newer, and the level 0 files are the newest.

Compaction is always scheduled for sorted runs with consecutive time ranges and the outputs are always another sorted run. We always place compaction outputs to the highest possible level, following the rule of older data on levels with larger numbers.

Use the same example shown above. We have following sorted runs:

  1. File0_1
  2. File0_2
  3. Level 4: File4_0, File4_1, File4_2, File4_3
  4. Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

If we compact all the data, the output sorted run will be placed in level 5. so it becomes:

  1. Level 5: File5_0', File5_1', File5_2', File5_3', File5_4', File5_5', File5_6', File5_7'

Starting from this state, let’s see how to place output sorted runs if we schedule different compactions:

If we compact File0_1, File0_2 and Level 4, the output sorted run will be placed in level 4.

  1. Level 0: File0_0
  2. Level 1: (empty)
  3. Level 2: (empty)
  4. Level 3: (empty)
  5. Level 4: File4_0', File4_1', File4_2', File4_3'
  6. Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

If we compact File0_0, File0_1 and File0_2, the output sorted run will be placed in level 3.

  1. Level 0: (empty)
  2. Level 1: (empty)
  3. Level 2: (empty)
  4. Level 3: File3_0, File3_1, File3_2
  5. Level 4: File4_0, File4_1, File4_2, File4_3
  6. Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

If we compact File0_0 and File0_1, the output sorted run will still be placed in level 0.

  1. Level 0: File0_0', File0_2
  2. Level 1: (empty)
  3. Level 2: (empty)
  4. Level 4: File4_0, File4_1, File4_2, File4_3
  5. Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

If options.num_levels=1, we still follow the same placement rule. It means all the files will be placed under level 0 and each file is a sorted run. The behavior will be the same as initial universal compaction, so it can be used as a backward compatible mode.

Assuming we have sorted runs

  1. R1, R2, R3, ..., Rn

How is all compactions are picked up:

Precondition: n >= options.level0_file_num_compaction_trigger

Unless number of sorted runs reaches this threshold, no compaction will be triggered at all.

(Note although the option name uses word “file”, the trigger is for “sorted run” for historical reason. For the names of all options mentioned below, “file” also means sorted run for the same reason.)

If pre-condition is satisfied, there are three conditions. Each of them can trigger a compaction:

1. Compaction Triggered by Space Amplification

If the estimated size amplification ratio is larger than options.compaction_options_universal.max_size_amplification_percent / 100, all files will be compacted to one single sorted run.

Here is how size amplification ratio is estimated:

Please note, size of Rn is not included, which means 0 is the perfect size amplification and 100 means DB size is double the space of live data, and 200 means triple.

The reason we estimate size amplification in this way: in a stable sized DB, incoming rate of deletion should be similar to rate of insertion, which means for any of the sorted runs except Rn should include similar number of insertion and deletion. After applying R1, R2 … Rn-1, to Rn, the size effects of them will cancel each other, so the output should also be size of Rn, which is the size of live data, which is used as the base of size amplification.

If options.compaction_options_universal.max_size_amplification_percent = 25, which means we will keep total space of DB less than 125% of total size of live data. Let’s use this value in an example. Assuming compaction is only triggered by space amplification, options.level0_file_num_compaction_trigger = 1, file size after each mem table flush is always 1, and compacted size always equals to total input sizes. After two flushes, we have two files size of 1, while 1/1 > 25% so we’ll need to do a full compaction:

  1. 1 1 => 2

After another mem table flush we have

  1. 1 2 => 3

Which again triggers a full compaction becasue 1/2 > 25%. And in the same way:

  1. 1 3 => 4

But after another flush, the compaction is not triggered:

  1. 1 4

Because 1/4 <= 25%. Another mem table flush will trigger another compaction:

  1. 1 1 4 => 6

Because (1+1) / 4 > 25%.

And it keeps going like this:

  1. 1 1 => 2
  2. 1 2 => 3
  3. 1 3 => 4
  4. 1 4
  5. 1 1 4 => 6
  6. 1 6
  7. 1 1 6 => 8
  8. 1 8
  9. 1 1 8
  10. 1 1 1 8 => 11
  11. 1 11
  12. 1 1 11
  13. 1 1 1 11 => 14
  14. 1 14
  15. 1 1 14
  16. 1 1 1 14
  17. 1 1 1 1 14 => 18

2. Compaction Triggered by Individual Size Ratio

We calculated a value of size ratio trigger as

Usually options.compaction_options_universal.size_ratio is close to 0 so size ratio trigger is close to 1.

We start from R1, if size(R2) / size(R1) <= size ratio trigger, then (R1, R2) are qualified to be compacted together. We continue from here to determine whether R3 can be added too. If size(R3) / size(R1 + R2) <= size ratio trigger, we would include (R1, R2, R3). Then we do the same for R4. We keep comparing total existing size to the next sorted run until the size ratio trigger condition doesn’t hold any more.

Here is an example to make it easier to understand. Assuming options.compaction_options_universal.size_ratio = 0, total mem table flush size is always 1, compacted size always equals to total input sizes, compaction is only triggered by space amplification and options.level0_file_num_compaction_trigger = 5. Starting from an empty DB, after 5 mem table flushes, we have 5 files size of 1, which triggers a compaction of all files because 1/1 <= 1, 1/(1+1) <= 1, 1/(1+1+1) <=1 and 1/(1+1+1+1) <= 1:

  1. 1 1 1 1 1 => 5

After 4 mem table flushes make it 5 files again. First 4 files qualifies for merging: 1/1 <= 1, 1/(1+1) <= 1, 1/(1+1+1) <=1. While the 5th one doesn’t: 5/(1+1+1+1) > 1:

  1. 1 1 1 1 5 => 4 5

They go on like that for several rounds:

  1. 1 1 1 1 1 => 5
  2. 1 5 (no compaction triggered)
  3. 1 1 5 (no compaction triggered)
  4. 1 1 1 5 (no compaction triggered)
  5. 1 1 1 1 5 => 4 5
  6. 1 4 5 (no compaction triggered)
  7. 1 1 4 5 (no compaction triggered)
  8. 1 1 1 4 5 => 3 4 5
  9. 1 3 4 5 (no compaction triggered)
  10. 1 1 3 4 5 => 2 3 4 5

Another flush brings it to be like

And no compaction is triggered, so we hold the compaction. Only when another flush comes, all files are qualified to compact together:

  1. 1 1 2 3 4 5 => 16

Because 1/1 <=1, 2/(1+1) <= 1, 3/(1+1+2) <= 1, 4/(1+1+2+3) <= 1 and 5/(1+1+2+3+4) <= 1. And we continue from there:

  1. 1 1 16 (no compaction triggered)
  2. 1 1 1 16 (no compaction triggered)
  3. 1 1 1 1 16 => 4 16
  4. 1 4 16 (no compaction triggered)
  5. 1 1 4 16 (no compaction triggered)
  6. 1 1 1 4 16 => 3 4 16
  7. 1 3 4 16 (no compaction triggered)
  8. 1 1 3 4 16 => 2 3 4 16
  9. 1 2 3 4 16 (no compaction triggered)
  10. 1 1 2 3 4 16 => 11 16

Compaction is only triggered when number of input sorted runs would be at least options.compaction_options_universal.min_merge_width and number of sorted runs as inputs will be capped as no more than options.compaction_options_universal.max_merge_width.

3. Compaction Triggered by number of sorted runs

“Try to schedule” I mentioned below will happen when after flushing a memtable, finished a compaction. Sometimes duplicated attempts are scheduled.

See Universal Style Compaction Example as an example of how output sorted run sizes look like for a common setting.

Parallel compactions are possible if options.max_background_compactions > 1. Same as all other compaction styles, parallel compactions will not work on the same sorted run.

4. Compaction triggered by age of data

For universal style compaction, the aging-based triggering criterion has the highest priority since it is a hard requirement. When trying to pick a compaction, this condition is checked first. If the condition meets (there are files older than options.periodic_compaction_seconds), then RocksDB proceeds to pick sorted runs for compaction. RocksDB picks sorted runs from oldest to youngest until encountering a sorted run that is being compacted by another compaction. These files will be compacted to bottommost level unless bottommost level is reserved for ingestion behind. In this case, files will be compacted to second bottommost level.

Subcompaction

Subcompaction is supported in universal compaction. If the output level of a compaction is not “level” 0, we will try to range partitioned the inputs and use number of threads of options.max_subcompaction to compact them in parallel. It will help with the problem that full compaction of universal compaction takes too long.

Following are options affecting universal compactions:

  • options.compaction_options_universal: various options mentioned above
  • options.level0_file_num_compaction_trigger the triggering condition of any compaction. It also means after all compactions’ finishing, number of sorted runs will be under options.level0_file_num_compaction_trigger+1
  • options.level0_slowdown_writes_trigger: if number of sorted runs exceeds this value, writes will be artificially slowed down.
  • options.level0_stop_writes_trigger: if number of sorted runs exceeds this value, writes will stop until compaction finishes and number of sorted runs turns under this threshold.
  • options.num_levels: if this value is 1, all sorted runs will be stored as level 0 files. Otherwise, we will try to fill non-zero levels as much as possible. The larger num_levels is, the less likely we will have large files on level 0.
  • options.target_file_size_base: effective if options.num_levels > 1. Files of levels other than level 0 will be cut to file size not larger than this threshold.
  • options.target_file_size_multiplier: it is effective, but we don’t know a way to use this option in universal compaction that makes sense. So we don’t recommend you to tune it.

Following options DO NOT affect universal compactions:

  • options.max_bytes_for_level_base: only for level-based compaction
  • options.level_compaction_dynamic_level_bytes: only for level-based compaction
  • options.max_bytes_for_level_multiplier and options.max_bytes_for_level_multiplier_additional: only for level-based compaction
  • options.expanded_compaction_factor: only for level-based compactions
  • options.source_compaction_factor: only for level-based compactions
  • options.max_grandparent_overlap_factor: only for level-based compactions
  • options.soft_rate_limit and options.hard_rate_limit: deprecated
  • options.hard_pending_compaction_bytes_limit: only for level-based compaction
  • options.compaction_pri: only for level-based compaction

Estimate Write Amplification

Estimating write amplification will be very helpful to users to tune universal compaction. This, however, is hard. Since universal compaction always makes locally optimized decision, the shape of the LSM-tree is hard to predict. You can see it from the mentioned above. We still don’t have a good Math model to predict the write amplification.

Here is a not-so-good estimation.

The estimation based on the simplicity that every time an update is compacted, the size of output sorted run is doubled of the original one (this is a wild unproven estimation), with the exception of the first or last compaction it experienced, where sorted runs of similar sizes are compacted together.

Take an example, if options.compaction_options_universal.max_size_amplification_percent = 25, last sorted run’s size is 256GB, and SST file size is 256MB after flushed from memtable, and options.level0_file_num_compaction_trigger = 11. Then in a stable stage, the file sizes are like this this:

Compaction stages are like following with its write amp:

  1. 256MB
  2. 256MB
  3. 256MB (write amp 1)
  4. 256MB
  5. --------
  6. 2GB (write amp 1)
  7. --------
  8. 4GB (write amp 1)
  9. --------
  10. 8GB (write amp 1)
  11. --------
  12. 16GB
  13. 16GB (write amp 1)
  14. 16GB
  15. 256GB (write amp 4)

So the total write amp is estimated to be 9.

Here is how the write amplification is estimated:

options.compaction_options_universal.max_size_amplification_percent always introduces write amplification by itself it is much lower than 100. This write amplification is estimated to be

WA1 = 100 / options.compaction_options_universal.max_size_amplification_percent.

If this is not lower than 100, estimate

WA1 = 0

Estimate total data size other than the last sorted run, S. If options.compaction_options_universal.max_size_amplification_percent < 100, estimate it using

S = total_size \ (options.compaction_options_universal.max_size_amplification_percent/100)*

Otherwise

S = total_size

Estimate SST file size flushed from memtable to be M. And we estimate maximum number of compactions for an update to reach maximum as:

p = log(2, S/M)

It is recommended that options.level0_file_num_compaction_trigger > p. And then we estimate the write amplification because of individual size ratio using:

And then the total estimated write amplification would be WA1 + WA2.