Learning #1: Log Structured Merge Trees
Yesterday while reading through the Stellar CAPs (Core Advancement Proposals), I came across the concept of Log Structured Merge Trees in the rationale behind Graydon Hoare’s performance improving CAP-0020. LSM Trees serve as the basis of the Bucket List Data Structure, a replication of the current state, or ledger used for:
- efficient hashing over the entire state following every change
- optimal diffing for nodes catching up to the current ledger
If you are interested in learning how the Bucket List fits into the data flow of Stellar Core, I totally recommend checking out this wonderful presentation Graydon put together.
So having never come across LSM trees in my university computer science courses, today I decided I would take some time to really figure out what an LSM tree is and how it works.
Log Structured Merge Tree (LSMT) Basics
– Credits to Source 1
- disk and memory access is always faster when accessed sequentially than randomly
- logging, or appending data to a file, is fully sequential so it provides very fast write performance
- however, reading data from a log is slow (optimal only in small workloads where data accessed in entirety or by a known offset)
- to get better read, we can structure the data (think binary search, hash, B+ tree, or external file), but we lose the write performance
- the issue with structuring data on write is that two IO’s are needed per write, one to read and one to write whereas in the logging approach we just have a write
- even worse is update-in-place, requiring even slower, random IO
- one solution is HashTable, but then it is possible for the index to be larger than the data
LSMT In a Nutshell
Updates are added to an in-memory buffer which preserves order. As the memtable is filled, its contents are flushed to a new file on disk. These files on disk are small, immutable, and sorted (obviously leading to some data redundancy since new updates go to new files). Thus not only does this make reading a bit easier later, it makes the periodic compaction and merge process much smoother.
Reading is done by first checking the memtable and then checking each file in reverse chronological order. Clearly as there are more and more files, reading will be slower. Two optimizations here are holding a page-index in memory and use of a bloom filter. However, even with these optimizations, LSM’s are slower on read than other alternatives and thus are good for high write throughput systems.
–Credits to Source 1
Merging, Compaction, and Deletion
Merging and compaction seeks to improve read efficiency by reducing the number of files and the redundant data. The most basic form of compaction is size-based where the compaction may take five, ten row files and compact them to one fifty row file, five, fifty row files and compact them into one 250 row file, etc. LevelDB, RocksDB and Cassandra utilize a better, level-based approach. This approach “reduces the number of files that must be consulted for the worst case read, as well as reducing the relative impact of a single compaction” [Source 1]. At each level, there are many files, however none of the files have overlapping keys (where the first level is an exception). Merging is done when a level fills up by merging a single file into an upper level at a time.
So how does deletion work? For this, we use a tombstone value. As noted in [Source 4], “because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values!”
Back to Stellar
Since Stellar utilizes the redundant Bucket List not for reading (that is what the SQL database is for), it makes sense that LSM Trees would be used since we only really need a high write throughput. Furthermore, [Source 1] points out that “cheaper SSDs and mechanical drives are better suited to LSM.” Since Stellar needs to function properly on low power CPU’s just as well as high power CPU’s, this is yet another reason to use LSM’s.
Going back full circle, CAP-0020 invesitigates the redundancy of dead entries, as noted in [Source 5]: “a substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existance; this is particularly noticeable when there is a high ratio of deletes.” The issue with Stellar is that “the current bucket list consists of a set of entries that can be in one of two states: live or dead. Dead entries (so-called “tombstones”) exist strictly to override the live-ness of a live entry in an older level of the bucket list. If a tombstone is present in the bucket list for which no live entry exists in an older level, that tombstone is redundant: there is nothing for it to override.The original design of the bucket list assumed that most entries in the ledger would be relatively long-lived (such as accounts) and therefore the presence of redundant tombstones would not be a major performance issue. In practice, we observe that the great majority of ledger entries have turned out to be short-lived (primarily offers) and therefore buckets have accumulated many redundant tombstones.” [CAP-0020]. This improvement will merge redundant, dead entries out of newer buckets much more effectively, making newer buckets smaller. How exciting! If you are interested in seeing the implementation details, check out this PR.
- Source 1 – Ben Stopford on LSM Trees
- Source 2 – Databass on LSM Trees
- Source 3 – Wikipedia on LSM Trees
- Source 4 – SSTables and LevelDB
- Source 5 – Leveled Compaction in Apache Cassandra
- Source 6 – CAP-0020