There are two cache implementations in RocksDB, namely
ClockCache. Both types of the cache are sharded to mitigate lock contention. Capacity is divided evenly to each shard and shards don’t share capacity. By default each cache will be sharded into at most 64 shards, with each shard has no less than 512k bytes of capacity.
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. To set a customized block cache, call
NewClockCache() to create a cache object, and set it to block based table options. Users can also have their own cache implementation by implementing the
To set compressed block cache:
table_options.block_cache_compressed = another_cache;
RocksDB will create the default block cache if
block_cache is set to
nullptr. To disable block cache completely:
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. Each shard of the cache maintains its own LRU list and its own hash table for lookup. Synchronization is done via a per-shard mutex. Both lookup and insert to the cache would require a locking mutex of the shard. User can create a LRU cache by calling
NewLRUCache(). The function provides several useful options to set to the cache:
capacity: Total size of the cache.
num_shard_bits: The number of bits from cache keys to be use as shard id. The cache will be sharded into
strict_capacity_limit: In rare case, block cache size can go larger than its capacity. This is when ongoing reads or iterations over DB pin blocks in block cache, and the total size of pinned blocks exceeds the capacity. If there are further reads which try to insert blocks into block cache, if
strict_capacity_limit=false(default), the cache will fail to respect its capacity limit and allow the insertion. This can create undesired OOM error that crashes the DB if the host don’t have enough memory. Setting the option to
truewill reject further insertion to the cache and fail the read or iteration. The option works on per-shard basis, means it is possible one shard is rejecting insert when it is full, while another shard still have extra unpinned space.
high_pri_pool_ratio: The ratio of capacity reserved for high priority blocks. See Caching Index, Filter, and Compression Dictionary Blocks section below for more information.
The benefit over
LRUCache is it has finer-granularity locking. In case of LRU cache, the per-shard mutex has to be locked even on lookup, since it needs to update its LRU-list. Looking up from a clock cache won’t require locking per-shard mutex, but only looking up the concurrent hash map, which has fine-granularity locking. Only inserts needs to lock the per-shard mutex. With clock cache we see boost of read throughput over LRU cache in contented environment (see inline comments in
cache/clock_cache.cc for benchmark setup):
Size Index/Filter Throughput(MB/s) Hit Throughput(MB/s) Hit
32 2GB yes 466.7 85.9% 433.7 86.5%
32 2GB no 529.9 72.7% 532.7 73.9%
32 64GB yes 649.9 99.9% 507.9 99.9%
32 64GB no 740.4 99.9% 662.8 99.9%
16 2GB yes 278.4 85.9% 283.4 86.5%
16 2GB no 318.6 72.7% 335.8 73.9%
16 64GB yes 391.9 99.9% 353.3 99.9%
16 64GB no 433.8 99.8% 419.4 99.8%
To create a clock cache, call
NewClockCache(). To make clock cache available, RocksDB needs to be linked with library. Again there are several options users can set when creating a clock cache:
capacity: Same as LRUCache.
strict_capacity_limit: Same as LRUCache.
Caching Index, Filter, and Compression Dictionary Blocks
By default index, filter, and compression dictionary blocks (with the exception of the partitions of partitioned indexes/filters) are cached outside of block cache, and users won’t be able to control how much memory should be used to cache these blocks, other than setting . Users can opt to cache index and filter blocks in block cache, which allows for better control of memory used by RocksDB. To cache index, filter, and compression dictionary blocks in block cache:
Note that the partitions of partitioned indexes/filters are as a rule stored in the block cache, regardless of the value of the above option.
By putting index, filter, and compression dictionary blocks in block cache, these blocks have to compete against data blocks for staying in cache. Although index and filter blocks are being accessed more frequently than data blocks, there are scenarios where these blocks can be thrashing. This is undesired because index and filter blocks tend to be much larger than data blocks, and they are usually of higher value to stay in cache (the latter is also true for compression dictionary blocks). There are two options to tune to mitigate the problem:
pin_l0_filter_and_index_blocks_in_cache: Pin level-0 file’s index and filter blocks in block cache, to avoid them from being evicted. Starting with RocksDB version 6.4, this option also affects compression dictionary blocks. Level-0 index and filters are typically accessed more frequently. Also they tend to be smaller in size so hopefully pinning them in cache won’t consume too much capacity.
pin_top_level_index_and_filter: only applicable to partitioned indexes/filters. If
true, the top level of the partitioned index/filter structure will be pinned in the cache, regardless of the LSM tree level (that is, unlike the previous option, this affects files on all LSM tree levels, not just L0).
SimCache is an utility to predict cache hit rate if cache capacity or number of shards is changed. It wraps around the real
Cache object that the DB is using, and runs a shadow LRU cache simulating the given capacity and number of shards, and measure cache hits and misses of the shadow cache. The utility is useful when user wants to open a DB with, say, 4GB cache size, but would like to know what the cache hit rate will become if cache size enlarge to, say, 64GB. To create a simulated cache:
// This cache is the actual cache use by the DB.
std::shared_ptr<Cache> cache = NewLRUCache(capacity);
// This is the simulated cache.
std::shared_ptr<Cache> sim_cache = NewSimCache(cache, sim_capacity, sim_num_shard_bits);
table_options.block_cache = sim_cache;
The extra memory overhead of the simulated cache is less than 2% of
A list of block cache counters can be accessed through if it is non-null.