Pluggable WAL
Data encryption
Warm block cache after flush and compactions in a smart way
Queryable Backup
Improve sub-compaction by making data partition more evenly
Tools to collect operations to a database and replay them
Build a new compaction style optimized for time series data
Implement YCSB benchmark scenarios in db_bench
Improve DB recovery speed when WAL files are large (parallel replay WAL)
use thread pools to do readahead + decompress and compress + write-behind. Igor started on this. When manual compaction is multi-threaded then we can use RocksDB as a fast external sorter — load keys in random order with compaction disabled, then do manual compaction.
expose merge operator in MongoDB + RocksDB
SQLite + RocksDB

Research Projects

Tuning workloads and split indexes into different column families, would take too much resources. When there is variety within one index there is not much that can be done currently. A clever algorithm should be able do handle such cases.

Self-tuning DBs

There is a long tail of users with small workloads, who are ok with sub-optimal performance but do not have the engineering resources to fine tune them. A list of default value for configuration also do not always work for them since it could give unacceptably bad performance for some particular workloads. An LSM tree that does not have any tuning knobs yet provides a reasonable performance for any given workload (even the corner cases) is much practical for the silent, long-tail of users.

Bloom filter for range queries

How to optimize RocksDB when it uses non-volatile memory (NVM) as the storage device?

RocksDB on hierarchical storage

How we can optimize RocksDB based on the particular characteristics of time series databases? For example can we do better encoding of keys knowing that they are integers in ascending order? Can we do better encoding of values knowing that they are floating point numbers with high correlation between adjacent numbers? What is an optimal compaction strategy given the patterns of data distribution in a time series DB? Do such data show different characteristics in different levels of the LSM tree and we can we leverage such information for more efficient data layout for each level? etc.

Open Projects

Research Projects

Self-tuning DBs

Bloom filter for range queries

RocksDB on hierarchical storage