EloqKV: Achieving Predictable P99.99 Latency on NVMe with Redis API

Posted by hubertzhang 1 day ago

Counter16Comment4OpenOriginal

Comments

Comment by hubertzhang 1 day ago

Most Redis alternatives that use disk for persistence struggle with tail latency (P9999) due to background maintenance or OS filesystem overhead. We built EloqKV on a custom storage engine, EloqStore, to solve this.

Key Architectural Choices:

- Custom B-tree Variant: Unlike LSM-trees used in many disk-backed stores, our B-tree variant avoids the "compaction stalls" that typically cause high tail latency during heavy writes.

- Coroutines & io_uring: We leverage io_uring for asynchronous I/O and use coroutines to manage thousands of concurrent I/O requests without the context-switching overhead.

- Object Storage Integration (optional): EloqStore uses object storage as the primary persistent layer, with NVMe acting as a high-speed cache/tier, providing durability without sacrificing speed.

We’ve reached a point where we can provide predictable P99.99 latency even when the working set is primarily on NVMe. We’d love to answer any questions about the storage internals or our benchmarking process.

Comment by the_precipitate 1 day ago

With DRAM price this high, this is certainly a welcome feature. But how do you control write latency? B+ Tree is pretty bad at updates and LMDB, another BTree based storage is lightning fast on reads but quite bad on writes compared with RocksDB.

Comment by iamlintaoz 1 day ago

The disk storage EloqKV uses (EloqStore [1]) is optimized for batch updates because the upper Data Substrate layer manages buffering and the Write-Ahead Log (WAL), absorbing writes and guaranteeing durability. When durability is not required, the WAL can be optionally disabled.

[1] github.com/eloqdata/eloqstore

Disclaimer: I am the CEO of EloqData

Comment by hubertzhang 1 day ago

we leverage batch write optimization which uses Copy-on-write B-tree variant enables high-throughput batch writes without blocking concurrent reads. MVCC-based design eliminates lock contention and provides predictable write amplification.

Comment by 1 day ago