Prefix Seek API

options.prefix_extractor is a shared_pointer of a SliceTransform instance. By calling SliceTransform.Transform(), we extract a Slice representing a substring of the Slice, usually the prefix part. In this wiki page, we use "prefix" to refer to the output of options.prefix_extractor.Transform() of a key. You can use fixed length prefix transformer, by calling NewFixedPrefixTransform(prefix_len), or you can implement your own prefix transformer in the way you want and pass it to options.prefix_extractor.

When options.prefix_extractor is not nullptr, iterators are not guaranteed a total order of all keys, but only keys for the same prefix. When doing Iterator.Seek(lookup_key), RocksDB will extract the prefix of lookup_key. If there is one or more keys in the database matching prefix of lookup_key, RocksDB will place the iterator to the key equal or larger than lookup_key of the same prefix, as for total ordering mode. If no key of the prefix equals or is larger than lookup_key, or after calling one or more Next(), we finish all keys for the prefix, we might return Valid()=false, or any key that is larger than the previous key, depending on whether ReadOptions.prefix_same_as_start=true. From release 4.11, we support Prev() in prefix mode, but only when the iterator is still within the range of all the keys for the prefix. The output of Prev() is not guaranteed to be correct when the iterator is out of the range.

When prefix seek mode is enabled, RocksDB will freely organize the data or build look-up data structures that can locate keys for specific prefix or rule out non-existing prefixes quickly. Here are some supported optimizations for prefix seek mode: prefix bloom for block based tables and mem tables, , as well as PlainTable format. One example setting:

  1. Options options;
  2.  
  3. // Enable prefix bloom for mem tables
  4. options.prefix_extractor.reset(NewFixedPrefixTransform(3));
  5. options.memtable_prefix_bloom_bits = 100000000;
  6.  
  7. // Enable prefix bloom for SST files
  8. BlockBasedTableOptions table_options;
  9. table_options.filter_policy.reset(NewBloomFilterPolicy(10, true));
  10. options.table_factory.reset(NewBlockBasedTableFactory(table_options));
  11.  
  12. DB* db;
  13. Status s = DB::Open(options, "/tmp/rocksdb", &db);
  14.  
  15. ......
  16.  
  17. auto iter = db->NewIterator(ReadOptions());
  18. iter->Seek("foobar"); // Seek inside prefix "foo"

From release 3.5, we support a read option to allow RocksDB to use total order even if options.prefix_extractor is given. To enable the feature set ReadOption.total_order_seek=true to the read option passed when doing NewIterator(), example:

Performance might be worse in this mode. Please, be aware that not all implementations of prefix seek support this option. For example, some implementations of doesn't support it and you'll see an error in status code when you try to use it. Hash-based mem tables might do a very expensive online sorting if you use it. This mode is supported in prefix bloom and hash index of block based tables.

Limitation

One common bug of using prefix iterating is to use prefix mode to iterate in reverse order. But it is not yet supported. If reverse iterating is your common query pattern, you can reorder the data to turn your iterating order to be forward. You can do it through implementing a customized comparator, or encode your key in a different way.

API change from 2.8 -> 3.0

In this section, we explain the API as of 2.8 release and the change in 3.0.

As of RocksDB 2.8, there are 3 seek modes:

This is the traditional seek behavior you'd expect. The seek performs on a total ordered key space, positioning the iterator to a key that is greater or equal to the target key you seek.

  1. Slice key = "foo_bar";
  2. iter->Seek(key);

Not all table formats support total order seek. For example, the newly introduced format only supports prefix-based seek() unless it is opened in total order mode (Options.prefix_extractor == nullptr).

Options.prefix_extractor is a prerequisite. The is constrained to the prefix provided by ReadOptions, which means you will need to create a new iterator to seek a different prefix. The benefit of this approach is that irrelevant files are filtered out at the time of building the new iterator. So if you want to seek multiple keys with the same prefix, it might perform better. However, we consider this is a very rare use case.

This mode is more flexible than ReadOption.prefix. No pre-filtering is done at iterator creation time. As a result, the same iterator can be reused for seek of different key/prefix.

  1. ReadOptions ro;
  2. ro.prefix_seek = true;
  3. auto iter = db->NewIterator(ro);
  4. Slice key = "foo_bar";
  5. iter->Seek(key);

Same as ReadOptions.prefix, Options.prefix_extractor is a prerequisite.

It becomes obvious that 3 modes of seek are confusing:

  • One mode would require another option to be set (e.g. );
  • It is not obvious to our users which mode of the last two is preferred under different circumstancesThis change tries to address this issue and makes things straight: by default, Seek() is performed in prefix mode if Options.prefix_extractor is defined and vice versa. The motivation is simple: if is provided, it is a very clear signal that underlying data can be sharded and prefix seek is a natural fit. Usage becomes unified:

Transition to the new style should be simple: remove the assignment to Options.prefix or Options.prefix_seek, since they are deprecated. Now, seek directly with your target key or prefix. Since can go across the boundary to a different prefix, you will need to check the end condition:

  1. auto iter = DB::NewIterator(ReadOptions());
  2. for (iter.Seek(prefix); iter.Valid() && iter.key().starts_with(prefix); iter.Next()) {
  3. // do something
  4. }