01. Complex systems
are built from simple ideas
Complex software like databases, compilers, and browsers are treated
like black boxes. You use them every day as a user, but you
probably don’t understand them as a programmer, even though
they are nothing but code. Why?
- They have little in common with programmers’ daily task.
- Their code bases are so large, so discouraging.
But that doesn’t mean you can’t learn these things. People have built
small, toy versions of these things for learning purposes, such as C in 4 Functions, Crafting Interpreters, Web Browser Engineering. It
turns out that if you minimize uninteresting details and focus on the
essentials, you can recreate seemingly complex software and learn its
core ideas.
For compilers, some of the core ideas include:
- Parsing stuff with recursion.
- Representing a program as a tree and simulate it.
- Representing a program as a list of instructions with gotos.
- Stack machine or register machine.
None of these ideas are complicated. So a compiler is a viable and
useful exercise for programmers.
02. What are the core
ideas of databases?
I’ve built a small database in 3000 lines
from scratch in Go to learn the core ideas of databases. It’s not that
complicated if you approach it in a certain way.
2.1 Power-loss atomicity
A database stores data on disk, but why not just use files? Why is
the filesystem not a database? Because databases eliminate a common
source of data corruption: partially written files caused by a crash or
power loss.
Video games warn you not to turn off the power while saving, because
a partially written save file can destroy all your progress. Building a
database will teach you how to eliminate this concern. The solution is
an append-only log with a checksum for each item.
- Append-only means that writes will not destroy existing data.
- Checksum means that partial writes will be detected and discarded,
making appends power-loss atomic.
╔═══╤════╦═══╤════╦══╌╌╌══╦═══╤════╗
║sum│data║sum│data║more...║sum│data║
╚═══╧════╩═══╧════╩══╌╌╌══╩═══╧════╝
╰───┬────╯
possible partial write
These are the first 2 ideas. But they are not enough to build a
database.
2.2 Index data with data
structures
Logs have uses for systems like Kafka, but you cannot just put
everything in a log and forget about it. You have to find stuff in it,
so you need data structures like B+trees, LSM-trees, hashtables. That’s
where I started, I built a B+tree in 366 lines.
But if you just put a data structure on disk, you have the partial
write problem. Is it possible to apply the log-based ideas to data
structures? There are 2 ideas:
The first idea is to make the data structure append-only, like a log.
There are ways to make a B+tree append-only, also called a
copy-on-write B+tree. A copy-on-write tree does not
overwrite existing nodes, it creates new nodes from leaf to root for the
entire path. New nodes can be appended like a log.
d d D*
/ \ / \ / \
b e ──► b e + B* e
/ \ / \ / \
a c a c a C*
original updated
The other idea is to use both a log and a main data structure:
- Before updating the main data structure, store the intended updates
in the log. Each log record contains a list of “write this data to that
position”. - Then apply the updates to the main data structure.
- When the databases start, it can always apply the last log record to
fix possible partial writes in the main data structure, regardless of
its status.
I chose the copy-on-write B+tree because it doesn’t require a log,
which reduces the LoC. Moving the B+tree to disk involves interfacing
with the filesystem, including the fsync()
. In 601 lines, I
have an append-only KV store on disk.
2.3 Recycle and reuse unused
space
I have to handle the consequence of append-only because storage is
not infinite. I simply added a free list to recycle unused B+tree nodes.
In 731 lines, I have a practical KV store.
first_item
↓
list_head ─► [ next | xxxxx ]
↓
[ next | xxxxxxxx ]
↓
list_tail ─► [ NULL | xxxx ]
↑
last_item
2.4 Relational DB on KV
A SQL query can be arbitrarily complex, but it boils down to 2 data
structure operations: point query and range query. That’s why I started
with B+tree data structures. Databases are built on data structures, not
on some theories about relations.
I added a layer on top of KV to store records with primary keys and
columns. Each record is encoded as KV pair. The LoC increased to 1107.
But it’s still just a KV. So I added 2 more features:
- Range query, 1294 lines.
- Secondary index, 1438 lines.
I didn’t bother with SQL at this point because SQL is just a user
interface. I built APIs for databases which is called a storage
engine.
2.5 Concurrent control
I can ignore concurrency and make everything serialized. But using a
copy-on-write B+tree means that readers get snapshot isolation for free,
because updates do not destroy old versions. So readers will not be
blocked by the writer.
I went a step further to allow write transactions to run
concurrently, and then merge the updates at the end of the transaction.
This requires detecting and rejecting conflicting updates. In 1702
lines, I have a transactional interface.
2.6 SQL-like query language
My user interface is just a set of APIs to operate the database. The
next step is a query language. I chose a SQL-like syntax. I only
implemented simple queries with a single point/range query, so there is
no query planner.
Parsing SQL is not a database topic, but a compiler topic. It’s done
by applying recursion in a specific way, which is hardly an idea once
you’ve learned it.
A query language is not just about data, it can do computations, like
SELECT a + b FROM c
. So I need an interpreter.
Then I need lots of glue code between the query language and the
database. The lines of code increased from 1702 to 2795, most of them
uninteresting.
03. Database in 3000
lines, incrementally
This is my minimalist attempt to build a database from scratch. Each
step adds an essential part while minimizing uninteresting details.
366 | B+tree data structure. |
601 | Append-only KV. |
731 | Practical KV with a free list. |
1107 | Tables on KV. |
1294 | Range queries. |
1438 | Secondary indexes. |
1461 | Transactional interfaces. |
1702 | Concurrency control. |
2795 | SQL-like query language. |
The lessons as a book
I’ve turned this into a book so that you can
follow my steps. The first half of the book is free online and contains
the most interesting parts of the storage engine, which is usable as a
KV on its own.
3000 lines is not much, but it’s an effective exercise. The first
edition took me only 2 months of free time. The book is updated whenever
I learn something new, so this is the second edition.
Premium IPTV Experience with line4k
Experience the ultimate entertainment with our premium IPTV service. Watch your favorite channels, movies, and sports events in stunning 4K quality. Enjoy seamless streaming with zero buffering and access to over 10,000+ channels worldwide.
