In this article we will discuss transactional workloads and analytical workloads.
An application typically looks up a few records by a key. Records are inserted or updated based on the user’s input. These applications are interactive and require less latency Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).
The other distinctive access pattern came for analytic queries. This scans a huge number of records and performs aggregation on certain columns. These are often directed towards creating reports to aid in decision-making. This access pattern is known as online analytic processing (OLAP).
Initially, the same database was being used for both OLTP and OLAP but later OLAP was run on a dedicated database called data warehouse. Data Warehouse has data replicated from an OLTP database and is typically tuned for analytic workloads. The process of getting data into the warehouse is known as Extract–Transform–Load (ETL). The indexing pattern discussed in the article typically cover OLTP workloads.
The data models are different for OLAP databases. Typically, a star schema is used. This involves facts and dimension tables. Facts are the events which actually happen. For example, in a retail store the sales transaction is a fact. Dimensions are additional contextual information about the fact like product details. Fact table contains the product id while dimension table contains the product information. A variation of the star schema is known as snowflake schema where dimensions are further broken down into sub dimensions table.
Generally fact tables have petabytes of data and querying them efficiently is a challenge. But if we analyze the typical access patterns, even though there are 100 columns, a typical query access only few columns at a time. This leads to an interesting perspective - column storage.
Traditionally, for OLTP queries row storage makes sense. i.e. Get me all details of product id 123 whereas in OLAP it becomes get me the revenue of sales for a particular product over a timeframe. (sales amount, product and timestamp)
Column storage layout relies on each column file containing the rows in the same order so that it is easy to reassemble the row. Column storage also provides an easy way of compressing as each column is stored separately and each record is of the same data type. One of the compression techniques is to use bitmap encoding.
For example consider the following three possible values for a column - Yes, No and Maybe.
|value = Yes||100|
|value = No||010|
|value = Maybe||001|
The other aspect is improved query execution. Query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop. Column compression allows more rows from a column to fit in the same amount of L1 cache. This technique is known as vectorized processing.
As we see, column storage lends itself towards very good read performance but also means costly writes. Thus, it is more suitable for OLAP style workloads where there is a lot of reads compared to write.
All Data warehouses do not use only column oriented architectures but it is gaining popularity. Yet another aspect of data warehouses or analytical workloads is data cubes. These are data grouped by dimensions to handle common aggregate queries. This cannot handle every use case but it is built to handle common use cases for better performance.
Storage and Retrieval is a big portion, and we have covered it off in two different posts. Refer to previous post for details.