• Pluggable WAL
    • Data encryption
    • Warm block cache after flush and compactions in a smart way
    • Queryable Backup
    • Improve sub-compaction by making data partition more evenly
    • Tools to collect operations to a database and replay them
    • Build a new compaction style optimized for time series data
    • Implement YCSB benchmark scenarios in db_bench
    • Improve DB recovery speed when WAL files are large (parallel replay WAL)
    • use thread pools to do readahead + decompress and compress + write-behind. Igor started on this. When manual compaction is multi-threaded then we can use RocksDB as a fast external sorter — load keys in random order with compaction disabled, then do manual compaction.
    • expose merge operator in MongoDB + RocksDB
    • SQLite + RocksDB

    Research Projects

    Tuning workloads and split indexes into different column families, would take too much resources. When there is variety within one index there is not much that can be done currently. A clever algorithm should be able do handle such cases.

    Self-tuning DBs

    There is a long tail of users with small workloads, who are ok with sub-optimal performance but do not have the engineering resources to fine tune them. A list of default value for configuration also do not always work for them since it could give unacceptably bad performance for some particular workloads. An LSM tree that does not have any tuning knobs yet provides a reasonable performance for any given workload (even the corner cases) is much practical for the silent, long-tail of users.

    Bloom filter for range queries

    How to optimize RocksDB when it uses non-volatile memory (NVM) as the storage device?

    RocksDB on hierarchical storage

    How we can optimize RocksDB based on the particular characteristics of time series databases? For example can we do better encoding of keys knowing that they are integers in ascending order? Can we do better encoding of values knowing that they are floating point numbers with high correlation between adjacent numbers? What is an optimal compaction strategy given the patterns of data distribution in a time series DB? Do such data show different characteristics in different levels of the LSM tree and we can we leverage such information for more efficient data layout for each level? etc.