DDIA Reading Notes - Chapter 2

·

3 min read

Chapter Overview

In chapter 2 of the book "Designing Data-Intensive Applications", the author Martin Kleppmann dives deep into the concepts of data models and query languages. Kleppmann mainly compares three prominent data models: the relational model, the document model, and the graph data model.

  1. The Relational Model: This model, pioneered by Edgar F. Codd in the 1970s, represents data as tables with rows and columns. It forms the basis for SQL databases like PostgreSQL, MySQL, and Oracle.

  2. The Document Model: Here, data is stored as self-contained documents, often in a JSON or XML format. This model is used by NoSQL databases like MongoDB and CouchDB.

  3. The Graph Data Model: In this model, data is represented as a graph, with nodes (entities) and edges (relationships). Graph databases like Neo4j excel at handling highly interconnected data

After exploring data models, the chapter shifts focus to query languages, which are the interfaces we use to interact with databases and retrieve or manipulate data. Kleppmann covers two prominent query languages:

  1. SQL (Structured Query Language): This declarative language is used with relational databases. It allows you to specify the data you want without worrying about the implementation details.

  2. MapReduce Querying: This programming model is for processing large amounts of data in bulk. The MapReduce model consists of two main phases: the Map phase, where data is filtered and sorted, and the Reduce phase, where the intermediate results are combined and aggregated.

Mind Map

Key Concepts

  1. Normalization and Denormalization The concepts of normalization (reducing redundancy) and denormalization (introducing redundancy) in database design are critical trade-offs that significantly affect query performance, storage efficiency, and data consistency.

  2. Query Optimization The concept of query optimization, including techniques like indexing, execution planning, and distributed query processing, is crucial for achieving good performance in data-intensive applications.

  3. Replication and Partitioning The concepts of replication (creating redundant copies of data) and partitioning (splitting data across multiple nodes) are fundamental for building scalable, fault-tolerant, and performant distributed data systems.

  4. Data Locality: The concept of keeping related data as close together as possible, either on the same machine or in the same partition, to minimize the need for remote data access and improve query performance.

Interesting example

In Chapter 2, the author Martin Kleppmann uses the example of LinkedIn's system for storing and querying data about users, their professional connections, jobs, skills, and other domain-specific entities and relationships.

In the relational model, data is normalized into separate tables with relationships established via foreign keys. The document model embeds related data within a single document structure. The graph model represents entities as nodes and relationships as edges directly connecting them.

The choice between these models depends on factors like query patterns, data interconnectivity needs, scalability requirements, and data access patterns for the LinkedIn resume use case.