Skip to content

A Distributed SQL Database - Building the Database in the Public to Learn Database Internals

Notifications You must be signed in to change notification settings

TypicalDefender/entangleDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

Overview

I'm working on creating entangleDB, a project that's all about really getting to know how databases work from the inside out. My aim is to deeply understand everything about databases, from the big picture down to the small details. It's a way for me to build a strong foundation in database.

The name "entangleDB" is special because it's in honor of a friend who loves databases just as much as I do.

The plan is to write the database in Rust. My main goal is to create something that's not only useful for me to learn from but also helpful for others who are interested in diving deep into how databases work. I'm hoping to make it postgresSQL compatible.

Usage

Pre-requisite is to have the Rust compiler; follow this doc to install the Rust compiler

entangledb cluster can be started on localhost ports 3201 to 3205:

(cd husky/cloud && ./build.sh)

Client can be used to connect with the node on localhost port 3205:

cargo run --release --bin entanglesql

Connected to EntangleDB node "5". Enter !help for instructions.
entangledb> SELECT * FROM dishes;
poha
breads
korma

TODO

  1. Make the isolation level configurable; currently, it is set to repeatable read (snapshot).
  2. Implement partitions, both hash and range types.
  3. Utilize generics throughout in Rust, thereby eliminating the need for std::fmt::Display + Send + Sync.
  4. Consider the use of runtime assertions instead of employing Error::Internal ubiquitously.
  5. Revisit the implementation of time-travel queries

MVCC in entangleDB

image

SQL Query Execution in entangleDB

image

entangleDB Raft Consensus Engine

image

What I am trying to build

1. Distributed Consensus Engine

The design for entangleDB centers around a custom-built consensus engine, intended for high availability in distributed settings. This engine will be crucial in maintaining consistent and reliable state management across various nodes.

A key focus will be on linearizable state machine replication, an essential feature for ensuring data consistency across all nodes, especially for applications that require strong consistency.

2. Transaction Engine

The proposed transaction engine for entangleDB is committed to adhering to ACID properties, ensuring reliability and integrity in every transaction.

The plan includes the implementation of Snapshot Isolation and Serializable Isolation, with the aim of optimizing transaction handling for enhanced concurrency and data integrity.

3. Storage Engine

The planned storage engine for entangleDB will explore a variety of storage formats to find and utilize the most efficient methods for data storage and retrieval.

The storage layer is being designed for flexibility, to support a range of backend technologies and meet diverse storage requirements.

4. Query Engine

The development of the query engine will focus on rapid and effective query processing, utilizing advanced optimization algorithms.

A distinctive feature of entangleDB will be its ability to handle time-travel queries, allowing users to access and analyze data from different historical states.

5. SQL Interface and PostgreSQL Compatibility

The SQL interface for entangleDB is intended to support a wide array of SQL functionalities, including complex queries, joins, aggregates, and window functions.

Compatibility with PostgreSQL’s wire protocol is a goal, to facilitate smooth integration with existing PostgreSQL setups and offer a solid alternative for database system upgrades or migrations.

Proposed Architecture

Screenshot 2023-12-02 at 1 26 15 PM

SQL Engine

The SQL Engine is responsible for the intake and processing of SQL queries. It consists of:

  • SQL Session: The processing pipeline within a session includes:
    • Parser: Interprets SQL queries and converts them into a machine-understandable format.
    • Planner: Devises an execution plan based on the parsed input.
    • Executor: Carries out the plan, accessing and modifying the database.

Adjacent to the session is the:

  • SQL Storage Raft Backend: This component integrates with the Raft consensus protocol to ensure distributed transactions are consistent and resilient.

Raft Engine

The Raft Engine is crucial for maintaining a consistent state across the distributed system:

  • Raft Node: This consensus node confirms that all database transactions are in sync across the network.
  • Raft Log: A record of all transactions agreed upon by the Raft consensus algorithm, which is crucial for data integrity and fault tolerance.

Storage Engine

The Storage Engine is where the actual data is stored and managed:

  • State Machine Driver: Comprising of:
    • State Machine Interface: An intermediary that conveys state changes from the Raft log to the storage layer.
    • Key Value Backend: The primary storage layer, consisting of:
      • Bitcask Engine: A simple, fast on-disk storage system for key-value data.
      • MVCC Storage: Handles multiple versions of data for read-write concurrency control.

entangleDB Peers

  • interaction between multiple database instances or "peers".

Example SQL Queries that you will be able to execute in entangleDB

-- Transaction example with a table creation, data insertion, and selection
BEGIN;

CREATE TABLE employees (id INT PRIMARY KEY, name VARCHAR, department VARCHAR);
INSERT INTO employees VALUES (1, 'Alice', 'Engineering'), (2, 'Bob', 'HR');
SELECT * FROM employees;

COMMIT;

-- Aggregation query with JOIN
SELECT department, AVG(salary) FROM employees JOIN salaries ON employees.id = salaries.emp_id GROUP BY department;

-- Time-travel query
SELECT * FROM employees AS OF SYSTEM TIME '-5m';

Learning Resources I've been using for building the database

For a comprehensive list of resources that have been learning what to build in a distributed database, check out the Learning Resources page.

About

A Distributed SQL Database - Building the Database in the Public to Learn Database Internals

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages