Error Handling

Iterator::status() returns the error of the iterating. The errors include I/O errors, checksum mismatch, unsupported operations, internal errors, or other errors.

If there is no error, the status is Status::OK(). If the status is not OK, the iterator will be invalidated too. In another word, if Iterator::Valid() is true, status() is guaranteed to be OK() so it's safe to proceed other operations without checking status():

On the other hand, if Iterator::Valid() is false, there are two possibilities: (1) We reached the end of the data. In this case, status() is OK(); (2) there is an error. In this case status() is not . It is always a good practice to check status() if the iterator is invalidated.

Seek() and SeekForPrev() discard previous status.

  • Seek() and didn't always discard previous status. Next() and Prev() didn't always preserve non-ok status.

A user can specify an upper bound of your range query by setting ReadOptions.iterate_upper_bound for the read option passed to NewIterator(). By setting this option, RocksDB doesn't have to find the next key after the upper bound. In some cases, some I/Os or computation can be avoided. In some specific workloads, the improvement can be significant. Note it applies to both of forward and backward iterating. The behavior is not defined when you do SeekForPrev() with a seek key higher than upper bound, or calling SeekToLast() with the last key to be higher than an iterator upper bound, although RocksDB will not crash.

Similarly, ReadOptions.iterate_lower_bound can be used to with backward iterating to help RocksDB optimize the performance.

See the comment of the options for more information.

Resource pinned by iterators and iterator refreshing

Iterators by themselves don't use much memory, but it can prevent some resource from being released. This includes:

  • memtables and SST files as of the creation time of the iterators. Even if some memtables and SST files are removed after flush or compaction, they are still preserved if an iterator pinned them.

Prefix iterator allows users to use bloom filter or hash index in iterator, in order to improve the performance. However, the feature has limitation and may return wrong results without reporting an error if misused. So we recommend you to use this feature carefully. For how to use the feature, see . Options total_order_seek and prefix_same_as_start are only applicable in prefix iterating.

Read-ahead

RocksDB does automatic readahead and prefetches data on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. This automatic readahead is enabled only when ReadOptions.readahead_size = 0 (default value). On Linux, readahead syscall is used in Buffered IO mode, and an AlignedBuffer is used in Direct IO mode to store the prefetched data. (Automatic iterator-readahead is available starting 5.12 for buffered IO and 5.15 for direct IO).

If your entire use case is dominated by iterating and you are relying on OS page cache (i.e using buffered IO), you can choose to turn on readahead manually by setting . This is more helpful if you run on hard drives or remote storage, but may not have much actual effects on directly attached SSD devices.

ReadOptions.readahead_size provides read-ahead support in RocksDB for very limited use cases. The limitation of this feature is that, if turned on, the constant cost of the iterator will be much higher. So you should only use it if you iterate a very large range of data, and can't work it around using other approaches. A typical use case will be that the storage is remote storage with very long latency, OS page cache is not available and a large amount of data will be scanned. By enabling this feature, every read of SST files will read-ahead data according to this setting. Note that one iterator can open each file per level, as well as all L0 files at the same time. You need to budget your read-ahead memory for them. And the memory used by the read-ahead buffer can't be tracked automatically.