2. Query Models & Languages (DDIA)

This chapter covers a range of general-purpose data models for data storage and querying. Before going to data models, we have to first cover a bit of normalization.

Normalization

Let's consider the example of LinkedIn profile. A candidate profile contains an industry type and region of work. We can model both of them as free text or represent it as a standardized list (candidate for normalization). The most important advantages of standardizing are

Ease of update : Can correct the values in a single place
Better Search : Easy to find people belonging to a single region or industry

This is referred to as normalization in databases. The values are assigned an id and referred in all other relationships.

Hierarchical Model

This was the first database model as part of IBM's Information Management system. Data was organized as tree of records and each record was connected to other records through links. The major limitation of this model was that it modelled only one to many relationships.

Considering the LinkedIn example, in hierarchical model we would have to represent person record containing industry and region as text inside the record. This would be inefficient for search as we have to scan all the person records to find people belonging to a particular industry.

Network Model

The network model was standardized by a committee called the Conference on Data Systems Languages (CODASYL) and implemented by several database vendors. This supported children with multiple parents i.e many to many relationships.

Now for the above example, we can make industry and region as parents for the person record which leads to efficient searches. This addressed the modelling problem for search but lead to access path problem. Each record can have multiple links and querying records meant understanding the entire navigational path of the record.

Relational Model

Relational model laid out the data as collection of tuples(rows) under a relation (table). Some columns are marked as indexes or foreign keys. The main idea is to hide the implementation behind a clean declarative interface with the use of Query optimizer. Query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. The access path problem is offloaded to the query optimizer from the application developer.

NoSQL

Object Relational mismatch was the main factor towards the NoSQL movement. Considering LinkedIn profile, the person's resume is a single object and it is better to represent the data as a single record rather than a set of relationships. There were other driving forces also beyond the birth of NoSQL movement

Greater Scalability and higher throughput beyond relational databases
Open Source over commercial products
Schema Flexibility

Even NoSQL databases are not completely schema less. Relational Databases enforce schema on write whereas NoSQL enforces schema on read. We will cover two subsets of data models under NoSQL movement.

Document Databases

These are in similar vein to Hierarchical databases except they used the concept of foreign key or identifier for related records. That identifier is resolved at read time by using a join or follow-up queries. The main advantage here is the data locality i.e. an entire document or record is accessed when needed.

Graph Databases

Even relational databases are not great for many to many relationships. Graph databases evolved to address these needs. A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). Many applications such as social network, website require graph model and there are well known algorithms operating on graph model. This looks similar to the network model but it differs in important ways

Schema flexibility than CODASYL
Refer vertex by ID and not by access path
Use of declarative language like Cypher or SPARQL than imperative programming

All three models (document, relational, and graph) are widely used today, and serve good purpose. One model can be emulated in terms of another model(converging use cases) but the result is often awkward. This is the reason for the drift towards polyglot persistence and there is not one-size-fits-all solution