With buffered I/O, the data is copied twice between storage and memory because of the page cache as the proxy between the two. In most cases, the introduction of page cache could achieve better performance. But for self-caching applications such as RocksDB, the application itself should have a better knowledge of the logical semantics of the data than OS, which provides a chance that the applications could implement more efficient replacement algorithm for cache with any application-defined data block as a unit by leveraging their knowledge of data semantics. On the other hand, in some situations, we want some data to opt-out of system cache. At this time, direct I/O would be a better choice.

The way to enable direct I/O depends on the operating system and the support of direct I/O depends on the file system. Before using this feature, please check whether the file system supports direct I/O. RocksDB has dealt with these OS-dependent complications for you, but we are glad to share some implementation details here.

  1. File R/W

API

It is easy to use Direct I/O as two new options are provided in :

The code is self-explanatory.

Recent releases have these options automatically set if direct I/O is enabled.

  1. allow_mmap_reads cannot be used with use_direct_reads or use_direct_io_for_flush_and_compaction. cannot be used with use_direct_io_for_flush_and_compaction, i.e., they cannot be set to true at the same time.
  2. use_direct_io_for_flush_and_compaction and use_direct_reads will only be applied to SST file I/O but not WAL I/O or MANIFEST I/O. Direct I/O for WAL and Manifest files is not supported yet.
  3. After enable direct I/O, compaction writes will no longer be in the OS page cache, so first read will do real IO. Some users may know RocksDB has a feature called compressed block cache which is supposed to be able to replace page cache with direct I/O enabled. But please read the following comments before enable it:
  • OS page cache provides read ahead. By default this is turned off in RocksDB but users can choose turn it on. This is going to be useful in range-loop dominated workloads. RocksDB compressed cache doesn’t have anything to match the functionality.
  • Possible bugs. The RocksDB compressed block cache code has never been used before. We did see external users reporting bugs to it, but we never took more steps improve this component.
  1. Automatic Readahead is enabled for Iterators in Direct IO mode as well. With this, long-range and full-table scans benchmarks in Sysbench (via a MyRocks build) match that of the Buffered IO mode.
  2. It is advisable to turn on mid-point insertion strategy for the Block Cache if your workload is a mix of point and range queries, by setting LRUCacheOptions.high_pri_pool_ratio = 0.5. (Note that this depends on and cache_index_and_filter_blocks_with_high_priority as well).