Forem: Ayokunle Adeniyi

Learn about the history of the data stack here

Ayokunle Adeniyi — Thu, 05 Mar 2026 10:46:24 +0000

The evolution of the Modern Data Stack: From RDBMS to the LakeHouse

Ayokunle Adeniyi for AWS Community Builders ・ Jan 13

#architecture #dataengineering #learning #discuss

Learnings from Pursuing High Data Quality: A Reflective Piece

Ayokunle Adeniyi — Thu, 05 Mar 2026 10:00:00 +0000

Cover Photo by Claudio Schwarz on Unsplash

Before diving in, let us discuss what data quality entails. IBM puts it so well - Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness and fitness for purpose, and it is critical to all data governance initiatives within an organisation. Data quality does not only care about the data being pristine, but also being fit for intended use, meaning that data quality is context-specific. The domain in which the data is collected and used is as important as the other checks. In fact, in many situations, it provides the foundation to define checks for accuracy, validity, consistency, timeliness and uniqueness. Furthermore, data quality can build or destroy trust within the team.

This is a reflective piece that encapsulates my experience setting up a roadmap for maintaining high-quality data. When asked previously about data quality enforcement and implementation, I always responded with “we can use the tool, or that tech to achieve this”. In practice, I was faced with a rude awakening about how limited that response was.

To provide a better context, let us define a particular metric where a group of skilled data analysts and scientists were asked to calculate the same metric for a product for a given time window - month, weeks, days, etc. The result was that everyone came with a different number representing that metric. These inconsistencies erode trust, and the output of any data process is predicated on trustworthiness. It matters because if people get different numbers for a metric, then the question is, “Is the data being collected good? Are we introducing errors during processing?"

This was the point at which I realised that data quality is not just about the tools. It is about key elements that must be present for the effective delivery of data products. I categorise them into 3 main elements. These elements are process, people, and technology, all working together and in harmony. I will expand on process, people, and technology in the following sections of this article.

People

As long as the data is going to be read and interpreted by more than one person, the people element must be considered. While this may seem like a downstream activity, it is essential to be addressed as early as possible in any workflow, request, whether automated, routine, or ad hoc.
To make this practical, the quality of the data starts from understanding the request being made. There must be a free flow of knowledge among all stakeholders in the delivery of any analysis. This means, when a senior executive asks, “What is the day 3 retention for product A?”, instead of going ahead to write fancy SQL and Python scripts, it is worth responding with clarifying questions such as:

Do you mean classic retention or rolling retention?
For a global product, do you want regional retention that may show regional patterns?
Should the retention be in UTC 24-hour cycles, and so on?

What this does is either give you clarity on what exactly needs to be calculated or give leeway for assumptions to be made. Overall, the quality of the resulting data and its interpretation depends on people communicating effectively. More so, in a team where analysts and engineers work collaboratively, clear definitions must be documented and made available to produce good downstream data.

Technology

The tooling list is ever-increasing these days. With tools like Great Expectations, Amazon Deequ, SODA, and platform-embedded tools like the validation rules in DBT, AWS Glue Data Quality and other similar tools, data quality checks are a solved problem technologically. The only questions worth asking are on cost, competencies of the team, and best fit with the existing tech stack; basically, the checks that occur during a tool assessment. These tools do a good job of creating valid definitions of what the data should have. Typical examples of what these tools provide are consistent ways to create, store and report the results of these checks.
Additionally, it is now a commonplace practice to treat data processing and analysis efforts similarly to software development practices. Therefore, writing maintainable, readable, and modifiable modular code becomes a requirement to foster collaboration and longevity, rather than luxury. Using version control systems like Git becomes a non-negotiable in achieving this.

Process

We have seen how people working together in harmony play a vital role in aligning on expectations. I consider ‘process’ to be the wrapper around people and technology. Good processes foster seamless interaction among people and with tools to achieve defined goals. For instance, a group of data engineers and data analysts define a process using the Write-Audit-Publish (WAP) pattern with data quality and validation tests at the audit layer. Therefore, no data product is published without passing all defined tests. Alternatively, large datasets might have preliminary checks to leverage fail-fast mechanisms.
Building effective processes is not always straightforward. Too many steps in a process make achieving a goal too tedious. Alternatively, processes that are too simple and lenient may lack the robustness to define the safe boundaries and guidelines required for consistent and sustainable results. A good process strikes that balance, and it may take several iterations to build that process.

Conclusion

It is tempting to argue that data quality can be achieved with only tooling, but without the right processes, tools become useless, and without the right people and a commitment to uphold a structure, processes are easily circumvented. This is what really matters. The absence of technology, process and people all working in harmony creates a fragile data quality framework for any organisation or project. Furthermore, the ideas discussed may not be seen as data quality but instead as an aspect of data governance. Also, there is a strong alignment with these same principles in the concept behind data contracts. Overall, the intricate application of data governance, data contracts, data quality frameworks or whatever it may be called in an organisation will rely on these and more.

The evolution of the Modern Data Stack: From RDBMS to the LakeHouse

Ayokunle Adeniyi — Tue, 13 Jan 2026 11:30:00 +0000

This post aims to provide a historical picture of the evolution of the typical data stack over a span of about 5 decades. A lot has happened, but I will try to keep it as simple and digestible as possible.

Fig 1: Evolution of the tools, concepts and technologies in the modern data stack

1970 - 1980s: The relational model, SQL, ACID and the RDBMS.

It all started with the relational model invented by Edgar F. Codd in the 1970s in his paper. Codd proposes the following: That data can be structured as a set of tuples (rows) where each value belongs to a field/attribute (column), altogether making a table that consists of a primary key that relates to other tables, by being a foreign key on other tables. He also proposes that users should not care about how the data is stored, that changes to the data should not break the applications using them. Lastly, he introduces normalisation, and that a language exists where users can consistently access and modify data. Within the next decade, specifically 1983, by Theo Härder and Andreas Reuter, the popular ACID (atomicity, consistency, isolation and durability) principles were established. These principles are important to mention as they are still relevant today and are at the centre of many of the innovations happening in the data stack infrastructure space. The relational model and ACID principles found very high adoption and utility in OLTP databases with that transactional workload. I have added these here briefly, as these concepts will be revisited later in this post in the Lakehouse section.

1980 - 2000s: The proliferation of the traditional Data Warehouse.

OLTP databases performed very well in scenarios where fast data retrieval was essential, as well as modifying and deleting operations on a record (row-by-row) basis. Analytical operations, such as aggregation, were considered to be expensive operations. They were expensive because they were data scan-heavy when they did not need to be. Take, for example, if one wanted to know the total revenue from a set of products, they would have to scan every product detail, such as product_name, product_id, product_category, order_number, amount, discount, and quantity. Ideally, to calculate that number, the only necessary fields required are amount * quantity, and product_category, for filtering.

Between 1980 and 2000, the innovations that helped with this were column-oriented architectures. This architecture allowed data to be stored by columns (fields) and not as rows. Additionally, other data modelling techniques arose. Examples of these data modelling techniques are the Kimball Data Modelling and One Big Table modelling. The column-oriented architecture in conjunction with the new data modelling techniques was found to yield performant results by scanning less I/O, allowing vectorisation and yielding better data compression as more and more data was being collected. Systems that used all these new innovations were termed ‘the data warehouse’. At the time, it was new, but now they are referred to as the traditional data warehouse.

2003 - 2006: Distributed Systems, MapReduce and cheaper storage

In 2003, the Google File System was developed. The Google File System is a distributed fault-tolerant storage system that runs on commodity hardware, built to meet Google’s increasing and rapidly growing data. The GFS is optimised for handling massive datasets and batch processing. Although it is not really used today and has been replaced by Google’s Colossus, the GFS was pivotal to the big data revolution. The GFS powered Google's search engine, allowing for efficient storage and fast data access. Also, it was able to store multimedia files.

In 2004, Google developed MapReduce, a processing framework for large-scale data. It enables parallel operations on massive datasets by splitting up the data into chunks, mapping tasks to their respective chunks and applying a reduce function to create a final output. These operations are carried out on clusters having a master slave (worker) architecture, with the master coordinating the operations and orchestration shuffle operations, if necessary, among worker nodes. The GFS, being a distributed file system, had a profound impact on the success of MapReduce by providing a reliable and scalable storage infrastructure that enabled data locality and fault tolerance.

Following GFS and MapReduce was the development of Hadoop in 2006. I have used just the word ‘Hadoop’ intentionally, as it is more like an ecosystem than just one thing. The Hadoop ecosystem consists mainly of 3 components: the Hadoop Distributed File System, MapReduce and YARN (Yet Another Resource Negotiator). Strongly inspired by both the GFS and Google's MapReduce, the HDFS and Hadoop MapReduce were created as open source implementations of these ideas, and YARN was developed to manage cluster resources.

Before we continue on Hadoop Ecosystem, here is a slight detour on Data Lakes and Variety in Big Data.

2010: The Variety Big Data Problem

In the previous sections, the data around a subject was structured, predictable, stayed in the shape almost through out its lifetime, and could be modeled to fit the tabular structure described in the relational model and Kimball data model. Using the relational model and traditional databases and warehouses required designing and carefully modelling the data to accurately describe an entity such that it was still accessible and analysis-ready. These modelling exercises took time and required expertise. One of the outcomes of this data modelling effort was schemas. Also, these schemas had to be known before any data was stored. From a database perspective, this is referred to as schema-on-write.

There are the V’s of big data. Volume, Variety, Velocity, Veracity and Value. Between 2000 and 2010, the variety of data stored began to shift more from just data that could easily be defined in a tabular structure to non-structured data, leading to the emergence of data lakes in 2010, a term coined by James Dixon. Simply put, A data lake is a central location that holds data in its native, raw format, whether structured or unstructured and at any scale. Data lakes did not need any predefined schema for the data stored. In so doing, data lakes deferred the need for structuring the data until the data was accessed, instead of at the point where the data was stored. When reading the data, the schema was defined, more like inferred, at the point when the data is retrieved, schema-on-read.

2006 - 2013: SQL-on-Hadoop - From batch to interactive analysis

Analysts were fond of SQL. SQL had already become very widely used at this time, especially with the proliferation of certain technologies (RDBMS and the data warehouse specifically) that have been discussed in the post. As a result, Hive was developed to cater to the needs of analysts by enabling SQL-like queries on massive datasets on the Hadoop Distributed File System without having to write complex Java programs. It also provided a centralised metadata store for all datasets in what is referred to today as the Hive Metastore. This development began in 2006, and in 2009, Facebook released their paper introducing Hive. In 2008, Hive was open-sourced, and it became a top-level Apache project. This development brought about SQL-like queries on HDFS, but not interactive analysis like current-day analysts now do, or the typical data warehouses provided.

While the initial version of Hive allowed for SQL-like operations, these operations were really only suitable for batch operations due to latency. Before going into what technology allowed for interactive analysis in the Hadoop ecosystem, it is worth introducing Dremel. Dremel revolutionised how SQL operations were run on object storage. It was developed by Google from about 2006 but made public in 2010. Therefore, it can be argued that Dremel pioneered the “interactive era”. It is worth noting that Dremel is sometimes referred to as a query engine only and sometimes as both a query engine and a specific columnar storage format. We will talk a bit more about formats shortly. Dremel is still widely used today as it powers Google's BigQuery.

As a response to Dremel and strongly inspired by it, many developments have occurred to allow interactive analysis in the Hadoop ecosystem. From 2013-2015, there was a focus on efficiency, leading to the development of Apache Tez, replacing the traditional MapReduce. Apache Tez reduced the time for many queries from minutes to seconds. In this same time window, massive parallel processing engines like Apache Impala and Presto (SQL on Everything), now known as Trino, were developed. Trino, today, is one of the core technologies powering AWS Athena.

2013 and Ongoing: Columnar file formats on data lakes

As briefly introduced earlier, Dremel is sometimes referred to as a query engine and a nested columnar data format. The data model was essential to the performance gains that Google was able to achieve by eliminating processing overhead. Similarly, other columnar data formats began to emerge. One of them is the very popular Apache Parque*t*, released in 2013. Parquet is a column-oriented storage format designed by Twitter and Cloudera to improve on Hadoop's existing storage format. To learn more about the Parquet file format, see this blog post. Another known columnar file format is the ORC (Optimised Row Columnar) file format, built for the Hadoop Ecosystem too to improve storage and analysis efficiency.

Today, there are more columnar formats; however, Parquet is very much still widely used. These columnar file formats helped speed up queries, lent themselves excellently to data compression, supported parallel processing, and improved storage efficiency. It also supported schema evolution.

2015: Streaming Semantics, Unification of Batch and Stream

It is worth calling out Apache Spark and Apache Flink. Apache Spark was originally developed in 2009 at UC Berkeley's AMPLab, and Flink in 2010 as part of a project named Stratosphere. Data processing can typically be categorised into streaming or batching, and this categorisation usually determines the tool and technologies used. For batch processing, Apache Spark was a go-to tool, while for streaming, Apache Flink was a common choice, and a result, led to separately maintained code bases for batch and stream data processing code. This architecture was referred to as the Lambda Architecture.

In 2015, Google published a research paper on the Dataflow Model. This research provided the theoretical foundation for treating batch and streaming as the same problem. The overarching idea is to treat continuous, messy data as permanently “incomplete” and provide simple building blocks so developers can decide, for each job, how much accuracy, speed, and cost they want, while letting the system handle ordering, windowing, and updates. In the same year, Google released a commercial product offering that was an implementation of their dataflow idea. Conveniently, the product was called Google Cloud Dataflow. In 2016, the Google Cloud Dataflow SDK was made open source and donated to the Apache Foundation; that open source SDK is today known as Apache Beam. This provided a standardised, engine-agnostic way to write unified pipelines that could run on multiple runners like Spark and Flink.

Within this same period, streaming in Apache Spark became a thing, and Apache Flink programs could be developed to treat batches of data as a finite stream. As of 2018, there was a consolidation at the engine level, integrating for both stream and batch processing using the same internal logic. The unification of batch and stream is crucial for the Modern Data Lakehouse, which will be discussed in a later section in this post.

2016 - 2021: Open table formats - ACID on object stores

Remember the columnar file formats, the data lake and the query engines such as Presto? If you do, we can proceed as they are essential for the remaining sections. They are key parts of a system that is being put together in this section of the blog.

There was still a big challenge about data lakes, and it was maintaining transactional integrity. One thing traditional databases and data warehouses handled well. Other limitations of data lakes were vendor lock-in from different data engine layers, such as Spark, Trino, etc. To address this, Open Table Formats (OTF) were created. Open Table Formats brought database-like ACID operations to Open File Formats in data lakes.

It is key to distinguish Open File Formats and Open Table Formats. Open File Formats are file formats such as Parquet, while Open Table Formats are, simply put, a metadata layer over an Open File Format. Common Open File Formats are the Apache Hudi, developed by Uber in 2016, Apache Iceberg, developed by Netflix in 2017 and Delta Lake Open File Format, built by Databricks. They are all open-source and continue to be a critical component in the Lakehouse by enabling ACID operations, schema evolution, better partition management, in many cases time travel (point-in-time recovery), and interoperability of the numerous data processing engines mentioned in this blog, such as Spark, Flink, Trino and so on.

In the next section, we will look at how all this innovation now integrates together to form what is called the Lakehouse.

2021 - Ongoing: The LakeHouse

The CIDR 2021 Lakehouse paper argued that implementing warehouse features (ACID tables, schema enforcement/evolution, indexing, cache, governance) on open data lake formats could meet or approach cloud‑warehouse performance without data duplication across lake + warehouse tiers. Lakehouses support SQL analytics and advanced machine learning directly, eliminate multiple ETL steps, reduce lock-in and staleness, and can reach competitive performance using new metadata layers. From the historical developments, one can argue it was the right time for all the technologies to be tightly integrated, such as it formed one big new product. To prove this argument, we look at Databricks Platform. A unified platform using the low-cost storage with an open file format, a metadata layer (delta-lake open file format), a performance layer known as the delta engine, declarative dataFrame APIs for Machine Learning and Data Science and a Multi-API Support Layer. The performance of the entire system was optimal in comparison to other traditional data warehouses using the TPC-DS Power Test, making a very strong case for the viability of the data lakehouse.

So far, the data lakehouse has had successes for the past few years. There have been other implementations within cloud providers such as the GCP BigLake, AWS CloudFormation, Azure Fabric and so on. Most of the cloud provider options are not natively a lakehouse solution, but they integrate some/most of the components. One major area of the lakehouse that is becoming more robust is the data governance features and capabilities.

TL;DR

Over the last 6 decades, analytics and data systems have evolved so much, with so much tooling and technologies. There have been very pivotal moments, such as the relational model and the traditional OLTP databases. In that same period, ACID principles were established to maintain the integrity of the databases. Row-level access was good, but aggregations were expensive; this led to the OLAP data warehouse, optimised for analytical workloads.

Then came the great decoupling, with the invention of Google File System, MapReduce, and distributed systems. The inseparable components of a once tightly knit data system began to become standalone, robust components. This storage system further allowed for the compute layer to be thought of and built as a separate but integratable component, as seen in Hadoop, Dremel, Presto, etc.

Now we had storage and compute, but not efficiency. This led to a focus on performance, forming the development of columnar file formats and then the unification of batching and streaming semantics, shifting from the Lambda architecture with the dataflow model, thinking of a single engine, with the internal functions to handle both instead of having separate codebases.

I consider us to be in a consolidation stage, a stage where the best of all worlds is coming together to form a big, well-thought-out product that has the advanced performance of the tightly knit data warehouses and the flexibility and interoperability that came from the decoupling, as seen in data lakes and the query engines. This consolidation is seen in the Lakehouse.

Other Worthy Mentions.

Data Catalogs - Hive Catalog, Unity Catalog, etc.

Interoperability layers for open file formats - Apache XTable

Useful Reads - to be completed

Understanding MapReduce

Comprehensive data catalog comparison

Kimball Data Modelling

YARN (Yet Another Resource Negotiator).

Data Compression, Types and Techniques in Big Data

Ayokunle Adeniyi — Tue, 25 Feb 2025 10:00:00 +0000

Cover Photo by Tim Mossholder on Unsplash

Introduction

This article will discuss compression in the Big Data context, covering the types and methods of compression. I will also highlight why and when each type and method should be used.

Diving in

What is compression?

According to the general English definition of compression which refers to reducing something to occupy a smaller space. In Computer Science, compression is the process of reducing data to a smaller size. Data, in this case, could be represented in text, audio, video files etc. Think of it as anything you store on the hard drive of your computer as data represented in different formats. To provide a more technical definition, compression is the process of encoding data to use fewer bits.

There are multiple reasons to compress data. The most common and intuitive reason is to save storage space. Other reasons are as a result of data being smaller. The benefits of working with smaller data include:

Quicker Data Transmission Time: Compressed data are smaller in size and take less time to be transmitted from source to destination.
Reduced bandwidth consumption: This reason is strongly linked to the advantage of quicker data transmission. Compressed data uses less of the network bandwidth, therefore increasing the throughput and reducing the latency.
Improved performance for digital systems that rely heavily on data: This is evident in systems that rely on processing data. Those systems leverage compression to improve the performance of the systems by reducing the volume of data that needs to be processed. Please note that this might be system-specific and will rely on using the appropriate compression technique. Compression techniques will be discussed later in this article.
Cost Efficiency: Cloud services charge for the storage of data. By using less storage, cost savings are introduced especially in Big Data systems.

Other reasons for compression depend on different compression techniques and formats. Some encryption algorithms can be used as a method of compression. In doing so, it includes a layer of security to the earlier discussed reasons to compress data. Additionally, using common compression formats brings compatibility and room for extensibility to external systems for integration purposes.

It is worth noting that the reasons for compression also sound like benefits. However, compression is not without trade-offs. One common trade-off to compression is the need for decompression which might be concerning for resource-constrained systems. Other trade-offs depend on the compression technique and type of data being used.

Systems in the article refer to digital systems that make use of data and can take advantage of compression techniques. The word systems is used quite loosely and should be interpreted in context with what is being discussed in that section.

Compression Types

To discuss the different techniques used to compress data, I will first categorise compression into 2 main categories. This article will then discuss the techniques relevant to each category. Compression can be broadly grouped into Lossy and Lossless compression.

As the names give away what they mean already, Lossy compression techniques are techniques that do not preserve the full fidelity of the data. Simply put, some data is discarded but not enough to make what the data represents unrecognisable. Hence, lossy compression can offer a very high level of compression compared to lossless compression which will be introduced shortly.

A characteristic of lossy compression is that it is irreversible, i.e. when presented with the compressed file, one cannot restore the raw data with its original fidelity. Certain files and file formats are suitable for lossy compression. It is typically used for images, audio and videos. For instance, JPEG formatted images lend well to compression and compressing a JPEG image, the creator or editor can choose how much loss to introduce.

On the other hand, lossless compression is reversible, meaning that when compressed, all data is preserved and restored fully during decompression. This implies that lossless compression is suitable for text-like files, and in the data warehouse and lakehouse world, it would be the only relevant type to use. Some audio (FLAC and ALAC) and image file (GIF, PNG, etc) formats work well with this compression type.

Choosing a method

There is no general best compression method. Different factors go into choosing what method would be suitable on a case-by-case basis. To buttress this with examples, a data engineer in the finance industry working on tabular data stored would tend to use lossless compression due to the impact of missing data in creating accurate reporting. Alternatively, lossy compression could be the way to go in optimizing the web page with a lot of images by compressing the images and reducing load items by making the website lighter. Therefore, it is crucial to conduct an assessment to determine the most appropriate compression method that aligns with business requirements.

Compression Techniques

This section will only cover the common compression techniques for both lossy and lossless compression. Please note that this is not in any way exhaustive. Furthermore, the techniques discussed may have slight variations to enhance their performance, as backed by different research.

Lossless compression techniques

Three common lossless techniques are the Run-Length Encoding (RLE), Huffman Coding and the Lempel-Ziv-Welch techniques.

Run-Length Encoding: RLE is based on encoding data, such that, it replaces sequences of repeating data with a single piece of data and the count of that piece of data. It is effective for long runs of repeated data. Also, datasets which have dimensions (fields) that are sorted from a low level to a high level of cardinality benefit from RLE.

For example, take a simple string like AAAAABBCDDD. RLE compresses the data to become A(5)B(2)C(1)D(3). To be more practical, take a table in the image below.

Figure 1 - before RLE. It is important to observe the level of cardinality is increasing on the fields from left to right

)

Figure 2 - After RLE

Because RLE depends on runs of repeated fields and in the second example, the cardinality and the sort order of the data, the Mouse record on the item column can not be compressed to just Mouse (3) because the preceding column splits all values into IT, Mouse and HR, Mouse. Certain file formats are compatible with RLE such as bitmap file formats like TIFF, BMP etc. Parquet files also support RLE making it very useful in modern data lakehouses using object storage like S3 or GCS.

Huffman Coding: It is based on statistical modelling that assigns variable length codes to values in the raw data based on the frequency at which they occur in the raw data. The representation of this modelling can be referred to as a Huffman tree, which is, similar to a binary tree. This tree is then used to create a Huffman code for each value in the raw data. The algorithm prioritizes encoding the most frequent values to the fewest possible bits.

Let's take the same data used in the RLE example AAAAABBCDDD. The corresponding Huffman tree looks like this.

Huffman Tree

From the tree, we can see that the letter A is represented by 0 likewise D is presented by 10. Compared to letters B: 111 and C:110, we observe that A and D are represented by fewer bits. This is because they have a higher frequency; Hence the Huffman algorithm represents them with fewer bits by design. The resulting compressed data becomes 00000111111110101010.

Huffman Coding uses the prefix rule which states that the code representing a character should not be present in the prefix of any other code. For example, a valid Huffman code can not have letters c and d represented using C: 00 and D: 000 because the representation of C is a prefix of D.

To see this in action, the Computer Science Field Guide has a Huffman Tree Generator you could play with.

Lempel–Ziv–Welch Coding: It was created by Abraham Lempel, Jacob Ziv and Terry Welch in 1984 and is named after the creators, obviously 😅. Similar to RLE and Huffman Coding, LZW works well with data that contain lots of repeated data. The LZW algorithm is dictionary-based and creates a dictionary containing key-value pairs of commonly seen patterns in the raw data. Such dictionary can also be referred to as the code table. Using an illustration to explain how this technique works, lets take our raw data to be represented by ABBABABABA. When passed through the algorithm using a configuration of A-Z as possible values, the resulting code table looks like

LZW Code Table

From the above code table, there is a key-value pair for all letters A-Z and key-value pairs for patterns such as AB, BB, BA, and ABA. By having a shorter representation of these patterns LZW algorithm can compress the raw by encoding it into fewer bits. Hence, using the code table generated from that input, the compressed version is 0 1 1 26 29 28. It is key to notice the spaces in the compressed data. One could think of them as the end of a character so the decoder will not interpret a 1,0 as a 10 as they mean different things.

LZW is usually general-purpose and widely used today. It is integrated into many Unix/Linux-based operation systems behind the compress shell command. Also, Common file formats compatible with LZW are GIF, TIFF and PDF. Other applications of LZW Compression can be seen in the field of Natural Language Processing as discussed in this paper on tokenization in NLP.

RLE, Huffman Coding, and LZW Coding are only common examples. Lossless compression techniques go beyond these three (3) described above. Other techniques include DEFLATE which uses a combination of Huffman and LZW - specifically LZ77 - Coding.

Lossy compression techniques

In this section, we will look into two types of lossy compression. Recall that lossy compression introduces a loss to the original data, meaning that not all data is kept.

Discrete Cosine Transform (DCT): This compression method is used mainly in audio, image and video files and is also commonly referred to as block compression. It uses a mathematical function - cosine function as the name implies - to convert blocks of the original data into frequencies. The blocks of data are usually a matrix of 8x8, 4x4 and so on, in that order of magnitude.

The compression comes in when dealing with the high frequencies occurring in the data once the raw data is translated into the frequency domain using the mathematical function. The overall process of using DCT for compression is:

Break down raw data into chunks. For instance, in image compression, this could be 8x8 pixels.
Apply the mathematical function to convert the chunks of data to frequencies. This will result in some high frequencies and low frequencies.
The high frequencies are then reduced or removed depending on the acceptable degree of loss one is willing to introduce. This is where it really becomes lossy compression.
To convert back to representable data, all the remaining frequencies are passed through an Inverse Discrete Cosine Transform - IDCT - to restore the data from the frequencies.

DCT is widely used in different fields today, not only in compression but also in signal processing. Common file formats compatible with DCT are JPEG (images), MP3 (audio), and MPEG (video). Additionally, DCT can achieve high compression ratios making it suitable for digital systems with lots of images like web pages on the Internet.

Fractal Compression: A fractal is a self-repeating infinite pattern that repeats at different scales. When viewed from any point on the scale, the pattern looks similar. Because the patterns are similar at any scale, fractal compression reduces the scale of 'big' fractals to reduce the size of the data.

Examples of Fractals

Fractal Compression was introduced by Michael Barnsley in the 1980s. The general idea using an image is that if an image contains several parts that look alike, why store them twice? To do this, fractal compression does the following:

Partitions the image into non-overlapping blocks known as range blocks. This could be range blocks of 8x8, 16x16 pixels, etc.
It scans the image for self-repeating patterns (fractal patterns). Using the range blocks, the algorithm finds larger sections of the image that are similar to these range blocks. These larger sections are referred to as the domain blocks.
A transform function is then applied to the domain block to approximate the range blocks. These transform functions are mathematical functions such as scaling, translation, rotation etc. They can also be referred to as transformations. These transformations are called fractal codes with respect to Fractal Compression.
The data is then encoded to those transform functions. Instead of storing the pixel-pixel data, the transformations are stored. These transformations are the rules that describe how to reconstruct the image from domain blocks.

With the fractal codes, the image is reconstructed using an iterative process. This process can be computationally expensive but fractal compression could achieve a high ratio of compression compared to other compression techniques. Due to its reliance on self-repeating patterns, it would perform better on data that conforms to having such self-repeating patterns. Examples would be landscape photographs (images of nature) and DNA images.

There are other lossy compression techniques such as Discrete Wavelet Transform, Quantization. These techniques are usually used in images, audio and video files and are suitable for certain types or file formats - JPEG, MP3 - for each file type.

Lossy compression generally has higher compression ratios than lossless compression and sometimes expects that the user knows the amount of loss to introduce beforehand. it is pertinent to emphasise that the choice of compression method and technique depends on several factors. At the core of these factors are the data format and the desired outcome.

TL;DR

Overall, this post discusses compression in the world of data. It relies strongly on the existing body of knowledge in computer science and information theory. To compress means to reduce the volume an entity occupies and in the field of data, volume refers to storage space. Compression in digital systems has many advantages when done right. The obvious is that it reduces the space and gives room to store more data. Other advantages include quicker transmission, lesser bandwidth usage and general improvement in the efficiency of said system. Remember, this is when it is done right.

To leverage the advantages of compression, it is key to know what type to use. Compression is either lossy or lossless. Lossy compression introduces a loss in the original data that is usually irreversible while lossless compression compresses the data and retains all the information contained in the original data. Furthermore, there is discourse on hybrid compression types but I think a combination of lossy and lossless is just lossy. Let me know what you think in the comments.

Lastly, different techniques were introduced for both lossy and lossless compression. The list of techniques and explanations of these techniques are neither exhaustive nor comprehensive. I consider them only a good start in giving you an idea of how each technique works. To wrap up, I have added additional resources to help you investigate further and read deeper about compression in big data.

Additional Resources

Video: Data Lake fundamentals - RLE encoding with Parquet in practice

Paper: A review of data compression techniques

Paper: lossless compression techniques

A concise introduction to Data Compression by David Salomon

Paper: A Study of Various Data Compression Techniques

Blog Post: Compression in open file formats

Article: Open file formats

Article: Compression in databases

Lossy Compression for Genomic data (RNA)

Custom Data Types in SQL

Ayokunle Adeniyi — Tue, 04 Feb 2025 09:05:00 +0000

Cover Photo by Xavi Cabrera on Unsplash

Before, we dive in, this post focuses on Oracle databases. PL SQL is a strongly typed language. For every variable used in a program or subprogram, the variable must be declared with a data type. PL SQL comes with some data types already defined. Examples of these data types are VARCHAR, VARCHAR2, NUMBER, DATE, etc., and so on. Generally, data types can be grouped into 4 main buckets which are scalars, LOB (Large objects) types, reference types, and composite data types. Scalars are atomic data types such as NUMBER, BOOLEAN, VARCHAR, and many more while composite data types consist of one or more scalars. Examples of composite data types are record types, collection types, and object types.

Think about small pieces of Legos coming together much like a puzzle to build a Lego Spider-Man. The same can be said about data types in databases. Although the inbuilt data types may not always be well suited to your needs, a combination of multiple data types could be used to fit into the puzzle or, in this case, your application. Furthermore, in SQL (PL SQL in this case), Oracle allows you to create custom data types that other programs and subprograms can use in the database. These are typically what composite data types are.

This article will focus mainly on scalars and composite data types. Let's dive in

Scalars

You can not create your own scalar because scalars are base types but you can create a subtype. Subtypes do not introduce a new type. They, however, place optional constraints on a base type. Generally, subtypes improve the readability of your code by indicating the intended use of the variable, for instance, a user-defined currency subtype will indicate that the variable is going to be used for finance-related activities. Subtypes also improve reliability by making use of the constraints.

Subtypes are defined in the declarative part of any PL SQL block, subprogram, or package using the syntax below

/* syntax to define subtypes */

SUBTYPE subtype_name IS base_type[(constraint)] [NOT NULL];

Examples of subtypes can be seen in the code snippet below.

DECLARE

    /* Number must be 9 digits */
    SUBTYPE T_NatIDNum IS PLS_INTEGER RANGE 100000000 .. 999999999; 
    SUBTYPE T_BirthDate IS DATE;

    /* Numbers will have maximum precision of 2 decimal places */
    SUBTYPE T_Height_weight IS NUMBER(10,2); 

    v_b_date T_BirthDate;
    v_height T_Height_weight;
    v_weight T_Height_weight;
    v_nat_id_number T_NatIDNum;

BEGIN

......

From the above snippet, the v_height and v_weight variable can only have a maximum precision of 2 decimal places while the v_nat_id_number variable must be between 100000000 and 999999999.

Composite Data Types

Composite data types (also known as user-defined types) can be created by a user. These composite data types can usually take 3 forms which are record types, object types, and collection types. All composite data types have internal components. These internal components could either be a scalar or another composite data type. Internal components can be accessed in a composite data type and this is done usually using the dot notation.

Record Types

It is similar to a row in a database table. its internal components can be of different data types and are referred to as fields. The simple snippet below shows how a record type is declared and used in a subprogram. The internal components are accessed using the dot notation as indicated in the executable part.

Note: Record types are usually used and declared in packages and subprograms and are usually not preceded with the CREATE keyword as shown in the example below.

DECLARE
    /* Record Type declaration */
    TYPE emp_contact IS RECORD (
        /* internal components */
        emp_id hr.employees.employee_id%TYPE,
        emp_email hr.employees.email%TYPE,
        emp_phone_no hr.employees.phone_number%TYPE
    );

    /* Variable declaration */
    v_emp_contact EMP_CONTACT;

BEGIN
    SELECT employee_id, email, phone_number INTO v_emp_contact 
    FROM hr.employees
    WHERE employee_id = 101;

    DBMS_OUTPUT.PUT_LINE('Employee ID: ' || TO_CHAR(v_emp_contact.emp_id)); 
    DBMS_OUTPUT.PUT_LINE('Employee Email: ' || LOWER(v_emp_contact.emp_email) || '@learnplsql.com'); 
    DBMS_OUTPUT.PUT_LINE('Employee Phone Number: ' || TO_CHAR(v_emp_contact.emp_phone_no)); 

END;

Object types

These allow you to create abstractions of real-world objects just like other object-oriented programming languages. Object types can have 3 components

Attributes: These can be user-defined types or the default scalars. They structure the object
Name: This is the name of the object. It must be unique in the schema.
Methods: Methods are functions or procedures that model the behavior of an object just like a real-world entity. They are usually preceded with the keyword MEMBER when specified as a component of the object. Methods can also be preceded with keywords STATIC or COMPARISON

*NOTE: Methods preceded with the MEMBER keyword have an implicit SELF parameter as the first parameter *

Objects types are created as stand-alone objects and can be used just like the built-in data types. Object types may require a body just like packages in PLSQL if the object has methods. Let's define 2 objects to model external parties in a company (visitors and vendors). The visitor object will be very simple while the vendor object will have methods defined as components.

/* Visitor object */
CREATE OR REPLACE TYPE obj_visitor AS OBJECT (
    visitor_id   NUMBER(4),
    first_name   VARCHAR2(30),
    last_name    VARCHAR2(30),
    whom_to_see  VARCHAR2(50)
)

While for the vendor object we are going to create, we have

/* Vendor Object */

/* object type specification */
CREATE OR REPLACE TYPE obj_vendor AS OBJECT (
    visitor_id NUMBER(4),
    first_name VARCHAR2(30),
    last_name VARCHAR2(30),
    whom_to_see VARCHAR2(50),
    company VARCHAR2(40),

    /* Constructor Method [optional] */ 
    CONSTRUCTOR FUNCTION obj_vendor(visitor_id NUMBER, first_name VARCHAR2, last_name VARCHAR2, company VARCHAR2) RETURN SELF AS RESULT,

    /* Other methods */
    MEMBER PROCEDURE insert_vendor (SELF IN OUT NOCOPY obj_vendor), 
    MEMBER FUNCTION display_vendor_details RETURN VARCHAR2)


/* object type body */
CREATE OR REPLACE TYPE BODY obj_vendor AS
    CONSTRUCTOR FUNCTION obj_vendor (
        visitor_id  NUMBER,
        first_name  VARCHAR2,
        last_name   VARCHAR2,
        company     VARCHAR2
    ) RETURN SELF AS RESULT AS
    BEGIN
        dbms_output.put_line('object constructor function fired ==>');
        self.visitor_id := visitor_id;
        self.first_name := first_name;
        self.last_name := last_name;
        self.company := company;
        RETURN;
    END;

    MEMBER PROCEDURE insert_vendor (
        self IN OUT NOCOPY obj_vendor
    ) AS
    BEGIN
        INSERT INTO visitors VALUES (
            visitor_id,
            upper(first_name),
            upper(last_name),
            upper(company),
            sysdate
        );

        COMMIT;
    END;

    MEMBER FUNCTION display_vendor_details RETURN VARCHAR2 AS
        v_details VARCHAR2(300);
    BEGIN
        --dbms_output.put_line('Vendor visitor details are');
        --dbms_output.put_line('Visitor ID: ' || to_char(visitor_id));
        --dbms_output.put_line('Visitor Name: ' || first_name || ' ' || last_name);

        v_details := 'Vendor with visitor ID: '
                     || TO_CHAR(visitor_id)
                     || ' and Fullname: '
                     || INITCAP(first_name)
                     || ' '
                     || INITCAP(last_name)
                     || ' from '
                     || UPPER(company)
                     || ' company.';

        RETURN v_details;
    END;

END;

Note: For member procedures, you can either pass parameters like regular procedures with IN, OUT keywords. if the SELF keyword is not passed, the parameter mode will default to the IN OUT configuration. However, for performance reasons, you can use the SELF IN OUT NOCOPY . You can read more on this here.

So now that we have created two (2) objects. Let's use them.

DECLARE
    v_vendor       obj_vendor; /* object is automatically null at this point */
    v_vendor_info  VARCHAR2(200);
BEGIN

    /* Instantiating the object and invoking the constructor */
    v_vendor := obj_vendor(v_id_seq.nextval, 'john', 'doe', 'pl/sql academy');

    /* Manually displaying the vendor ID */
    dbms_output.put_line('Vendor id is: ' || to_char(v_vendor.visitor_id));

    /* Calling the member method (function) */
    v_vendor_info := v_vendor.display_vendor_details();
    dbms_output.put_line(v_vendor_info);

    /* Calling the member method (procedure) */
    v_vendor.insert_vendor();
END;

Collection types

In a very simplified explanation, collections are basically arrays in PL SQL. There are 3 types: associative arrays, variable-sized arrays (VARRAYS), and nested tables. The sample below shows a simple associative array. Arrays have built-in methods like COUNT to get the length of the array. A comprehensive list of other methods can be found here. Elements in an array can be accessed using their indexes starting from 1 instead of 0 in scripting languages like python, etc.

DECLARE
    TYPE t_emp_info IS RECORD (
        emp_f_name  hr.employees.first_name%TYPE,
        emp_l_name  hr.employees.last_name%TYPE,
        emp_dept    hr.departments.department_name%TYPE
    );

    /* associative array type (collection)*/
    TYPE t_emp_array IS
        TABLE OF t_emp_info INDEX BY PLS_INTEGER;

    /* variable of collection type */
    v_emp_array      t_emp_array;

    /* misc */
    v_first_element  t_emp_info;
    v_last_element   t_emp_info;
    v_new_rec        t_emp_info;
    v_arr_length     NUMBER(5);
BEGIN
    SELECT
        first_name,
        last_name,
        department_name
    BULK COLLECT
    INTO v_emp_array
    FROM
        hr.employees      e,
        hr.departments    d
    WHERE
        e.department_id = d.department_id
    ORDER BY
        e.employee_id;

    v_arr_length := v_emp_array.count;
    dbms_output.put_line('Collection length: ' || to_char(v_arr_length));
    v_first_element := v_emp_array(1);
    dbms_output.put_line('First element: ' || v_first_element.emp_f_name);
    v_last_element := v_emp_array(v_arr_length);
    dbms_output.put_line('Last element: ' || v_last_element.emp_f_name);

    /* delete first element */
    v_emp_array.DELETE(1);

    /* First element becomes empty */
    BEGIN
        dbms_output.put_line('New First element: ' || v_emp_array(1).emp_f_name);
    EXCEPTION
        WHEN no_data_found THEN
            dbms_output.put_line('Element is null');
    END;

    /* Re assign first element */
    v_new_rec.emp_f_name := 'Jane';
    v_new_rec.emp_l_name := 'Doe';
    v_new_rec.emp_dept := 'Special-Ops';
    v_emp_array(1) := v_new_rec;

    /* print new first record */
    dbms_output.put_line('New First element: ' || v_emp_array(1).emp_f_name || ' of ' || v_emp_array(1).emp_dept || ' department.');
END;

The output of the above PL SQL blocks returns

Composite data types can be fun and handy. Although it may not be very common to see developers use objects and OOP in PL SQL, records and collections could be instrumental in making your code more optimal, better performing, and neater to read.

I hope it is pretty clear what custom data types are in PL SQL and what you can do with them. Don't forget that practice makes better and you can use the live SQL platform here to start practicing right away.

Understanding Database Indexes And Their Data Structures: Hashes, SS-Tables, LSM Trees And B-Trees

Ayokunle Adeniyi — Tue, 05 Mar 2024 09:01:00 +0000

There's often a huge fuss about making data-driven decisions, leveraging data analytics, using data science and data-centred thinking. From a technological point of view, data is usually stored and accessed using databases. Databases, often referred to as RDBMS, are sophisticated systems that abstract away the fairly complex logic and engine behind data storage on disk. Several databases exist in today's tech landscape, but this article will be focused on something common to data storage and retrieval.

Drum roll 🥁🥁🥁🥁🥁🥁🥁. We will be discussing indexes. The topic gives it away anyway.

Log-based Data Structures

My approach draws a slightly historical perspective to database indexes. To begin, we look at the simplest form of a database - A file. Think of this file as a log file. Every time we need to store stuff, we append the data to the log file. To retrieve the data, we traverse each entry till we get the information we want.

The above illustration sounds straightforward and is very efficient for storing data (write operations) because all it needs to do is an append operation. However, it introduces a huge challenge to retrieve data (read operations) when the data grows in volume. This is because the program has to go through all entries all the time. In computer science, the big-O notation for this sort of operation is O(n), n being the number of records.

For clarity, we can think of each entry as having a key and value. The indexes in this article will refer to the key in each entry.

This is where indexes come in. Imagine if we had a separate data structure that tells us where the information is. something like the famous indices at the back of the Oxford Dictionary. An index in this context is a data structure that is derived from the primary data that helps retrieve information quickly. There are multiple variations of indexes and we will get into them in the next sections

Hash indexes

A hash index is represented using a similar data structure to the dictionary data type in Python or a hashMap in Java.

Let's refer back to our simple example database where we have multiple entries appended to a log-based file. A hash index would be akin to having an in-memory key-value pair of every key in the data appended and its byte offset in the data file. In so doing, when we look up some data, we search the index for the key and find a pointer to the location of the actual values on the data file. Also, every time a value is added to the database, the hash index is updated to include that new entry using its key.

Before going into the pros and cons of hash indexes, I will build on the illustration of this database. since the existing database is a log-based database (append-only), when an already existing value is to be updated, the database does not seek for the data and modify the values in place. Rather, it appends the entire entry and read operations are built to make sure to get the latest value for any given key. This bolsters the need for an index because the hash map will also be updated to point to the most recent location for that entry.

Assuming we have a service that is transactional in nature. This implies that new entries will be added and existing entries will be changed frequently. For entirely new entries, this is no problem. However, every change to an existing entry will append a new entry and will make the older values redundant. If we use hash indexes to solve the issues of long lookup times, reclaiming disk space used by those redundant values in the database is still a significant challenge.

But how do we prevent the disk from running out of space? A simple solution to this is to split the log files. Each split can be referred to as a segment. After splitting the files into segments, a compaction process can be run in the background. This compaction takes a segment or multiple segments and merges them. In this process, only the latest values are kept and others are discarded and written into a new segment. After which, new operations are redirected from the older segments to the newly compact segments. Note that the database is still split into files but compaction reduces the amount of redundant keys in the database.

In relation to hash indexes, each segment will have its own in-memory hash maps, and these are also updated after merging and compaction also. When a lookup operation is done, it first checks the hash map of the most recent segment, then traverses backwards to the next most recent and the next and so on.

Limitations of hash indexes

In practice, log-based databases and hash indexes are very efficient but still have limitations. Some of the core limitations of the example above are poor concurrency control, crash recovery, partial writes, no support for delete operations and range queries are inefficient. Because the index is in-memory (RAM), if the server is restarted, all hash maps are lost. Additionally, the entire hash index must fit in memory. These limitations do not fit the requirements for how we interact with databases and the deluge of data we work with today.

To address these limitations, enhancements and changes are made to the existing data structure housing the hash map indexes per segment. Referring back to the current state of the database, we recall that we now have our data split into different segments and segments undergo compaction.

SSTables

We make a fairly simple change to how the data is stored in these segment files. The change we introduce is to sort the data (key-value) by the key. By doing this, the data is stored in a sorted format on disk using the keys. This is referred to as a Sorted String Table (SSTable). The obvious limitation it solves now when compared to plain hash maps is that we can now fully support range queries.

The term was coined in Google's Bigtable paper, along with the term memtable.

In comparison to log-based storage with hash indexes, SSTables introduce an additional layer to merging and compaction. What it does is use an algorithm very similar to the popular mergesort algorithm to maintain the order of all entries in the SSTable

It is worth noting that sometimes, SSTables are not referred to as indexes but as a data structure itself, which seems to be a better description of what an SSTable is. In this case, the accompanying index can be referred to as an SSIndex or SSTable index or Memtable. However, for the purpose of this article, SSTable will refer to the combination of both the datafile (sorted key-value pairs on disk) and its corresponding in-memory index containing the keys and their bytes offset.

SSTables vs Hash Indexes

All the advantages of hash indexes are preserved in SSTables. That is,

it is still efficient for write operations since it is log-based (append-only)
the in-memory index will act as a pointer to the actual location of the data on the disk
The compaction process in the background makes it efficient from a storage perspective

Key Advantages

The additional advantage when compared to Hash indexes are in 2 folds. First, we can now query ranges because our data is sorted by the key. Second, the index can be sparse. To explain the second point, let's use the example that we have keys from A0 to A1000. The index will not need to have 1000 keys but can have half of that which is 500 keys. So the keys could be A0, A2, A4, A6 and so on with their corresponding pointers to their location on the disk. When trying to retrieve the values associated with key A5, even though it is not in the index, we know that it is in between A4 and A6 and we can begin our search from there. Thus, making the index sparse without trading off read performance.

Drawbacks of SSTables

However, it is not without its limitations. The entire index still needs to fit within the memory of the server. If the server crashes, the index is lost. In a very busy transactional database, that is a lot of work required to keep the SSTable up to date. In the next section, we continue to build on our knowledge of log-based databases and indexes with LSM Trees.

Log-Structured Merge (LSM) Trees

We have established that log-based approaches to data storage can be very efficient. Just like the SSTables, LSM Trees are also log-based in their way of storing data on disk and have an in-memory data structure akin to Memtable in SSTrees. In fact, LSM Trees make use of SSTrees.

LSM Trees are layered collections of Memtables and SSTables. The first layer is the memtable stored in memory. The following layers are cascaded SSTables and these layers are stored on disk. Its major characteristic can be observed in how it handles write operations. During the write operations, the entries are initially added to the memtable and are then flushed to SSTables after an interval or when it reaches a certain size. This mechanism makes writes very fast but can slow up reads.

Implication on read and write operations

Reads have to look up the key in the memtable first for the most recent entries, then traverse through the layers of SSTables. Therefore, it is a painful operation to look up a key that does not exist in the LSM tree because it ends up searching through all the layers of data available. Bloom filters help mitigate and optimize these experiences by being able to quickly determine if a key might exist in the SSTable.

Overall, LSM Trees seem to provide superior performance when it comes to workloads that involve a lot of write operations while SSTables are preferable when quicker reads are essential.

Databases like CassandraDB, LevelDB use LSM Trees.

To address the limitation of preserving the indexes when the server crashes, it sounds intuitive to store the index on disk as it is smaller in size. Since it is smaller, it should be easier to read and update. However, that is not the case because of how storage disks are designed (SSDs and HDDs). In practice, what is done is that the index has a file that is read into memory when the database server is back up and running. Furthermore, for every write operation the key is added into the memtable - remember the memtable is the in-memory data structure - and a Write Ahead Log (WAL) which is persistent on disk. Recall that write operations are very efficient. Therefore, the essence of having a persistent WAL is so that in the event of a server crash, the WAL has all that is required to rebuild the in-memory index. This WAL can be applied to both the SSTable, LSM Trees and other indexing strategies.

B-Trees

All the indexes and data structures discussed above have something in common, which is they are all log-based. Here, I discuss a very popular and mature data structure that is widely used in the most popular databases today. B-Trees are data structures that keep and maintain sorted data such that they allow searches and sequential access to the data in logarithmic time. They are self-balancing and very similar in nature to a binary search tree, if you are familiar with programming.

They are tree-like and essentially break down the database into fixed-sized blocks referred to as pages. These pages are commonly 4KB in size by default, although, many RDBMS systems offer the option to change the page size. In comparison to the log-based which uses segments where data is append-only, B-Trees allows us to access and manipulate the data in place on the disk using references to those pages.

Because it is tree-like, it has one root, inner and leaf nodes. One node is designated to the root of the B-Tree and has pointers to children nodes. Every time a lookup is done using the key, you have to start there and traverse through different hierarchies to the key being looked for, guaranteeing access in logarithmic time.

Remember, we made the assumption that our data entries are key-value pairs. Therefore, it is worth noting that out of all the 3 types of nodes - root, inner and leaf - only the leaf nodes contain the actual information (values), the others only contain references to other nodes and so on till they point to the corresponding leaf node. Additionally, within the tree hierarchy, the outermost child references indicate the boundaries in that range.

When a write operation (insert or update in SQL terminology) is done, the idea is to locate the page where the values should be on disk and write the value to that page. Therefore, we must consider what happens when a page becomes full. In this case, a split operation occurs, The page is split into 2 and all cascading nodes above must be adequately updated to reflect the change that has occurred.

B-Trees have depth and width that are inversely proportional to one another. That is, the deeper the B-Tree the slimmer it is. The technical term for the width is referred to as the branching factor, which is defined as the number of references to child nodes within a single node. Linking back to a write operation that may cause a page to split, this will, in turn, require several updates if the B-Tree has a lot of depth. Additionally, careful measures must be put in place to protect the tree's data structure during split and concurrent operations, and to achieve this, internal locks (latches) are placed for the time they are being updated.

B-Trees are not log-based but they also use Write-Ahead logs to recover from crashes and for redundancy.

B-Trees in comparison to LSM Trees

Writes are slower in B-Trees in comparison to LSM Trees. On the other hand, reads are faster when using B-Trees. It is, however, important to experiment and test extensively for any use case. Benchmarking is essential when choosing the database and indexing strategy that would best support your workload.

TL;DR

Each data structure supporting different indexes has its strong points as well as areas of weakness. In this article, we have discussed 4 main data structures that power indexes. These are hash indexes, SSTables, LSM Trees and B-Trees. The first 3 have a log-based data structure, such that, they are append-only. Their respective limitations were mentioned, and how some address certain limitations, for instance, SSTables support range queries because the data is sorted. Some general optimizations such as the use of a Write Ahead Log for crash recovery, compaction and merging for saving disk space, and latches for concurrency control. Lastly, we briefly compared the performance of different pairs of data structures at the tail end of each section.

This article is strongly inspired by my current read: Designing Data-Intensive Applications by Martin Kleppmann. I completely recommend it if you want to broaden your understanding of data systems. Please share in the comment sections interesting books, articles, and posts that have inspired and helped you understand a concept better.

Functions in SQL

Ayokunle Adeniyi — Tue, 27 Feb 2024 09:01:00 +0000

Functions are very similar to procedures in databases. In this article, I will try to break down functions in SQL and also mention the differences between functions and procedures in a database.

Remember! The RDBMS used in this article is the Oracle database. Let's dive in!!

The common differences between functions and procedures are:

A function is a block of code called to perform a task and must return one or more values while a procedure is a block of code called to perform a task. Procedures do not have to return a value.
A function can be called in a procedure but a procedure cannot be called in a function.
A function can be called in a select statement but it must not contain IN or IN OUT parameters in the case while procedures cannot be executed in a select statement.

In PL SQL (Oracle), there are two (2) types of functions, we have the:

In-built functions: these are functions that come with the database at installation. An example of an in-built function is the NVL (used to handle null values) function

SELECT FIRST_NAME, NVL(HAS_SIBLINGS, 0) FROM STUDENTS;

The sample code above returns zero (0) where the HAS_SIBLINGS column has a null value.

User-defined functions: As the name implies, these are functions created by developers and users of the database.

Just like procedures, functions have two (2) main parts, the header and the body. The header has the name of the function and a RETURN clause specifying the data type to be returned by the function. The header looks something like

-- header

CREATE OR REPLACE FUNCTION get_emp_email(v_id IN NUMBER)
RETURN VARCHAR2

The function body has three (3) main parts which are the declarative section, the executable section and the exception handling section. The exception handling section is optional and does not have to be included. The syntax follows the format below

IS

[declarative section]

BEGIN

[executable section]

[EXCEPTION]

[exception-handling section]

END;

Variables, custom data types, cursors etc. are declared in the declarative part. The executable part contains the actual piece of code to be executed. It is between the BEGIN and END clauses and must contain at least one RETURN statement. Exceptions are handled in the exception handling block.

Let's create a function that will return an employee's email in the format firstname.lastname@company.com.

CREATE OR REPLACE FUNCTION get_emp_email(v_id IN NUMBER)
RETURN VARCHAR2
IS 
v_email VARCHAR2(150);
BEGIN
    SELECT LOWER(first_name  || '.' || last_name || '@learnplsql.com')
    INTO v_email
    FROM hr.employees
    WHERE employee_id = v_id;


    RETURN v_email;
END get_emp_email;

Because the function created above does not have OUT or IN OUT parameters, it can be used in a select statement. Let's get the email of employees with employee ID between 130 and 150.

SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, GET_EMP_EMAIL(EMPLOYEE_ID) EMAIL 
FROM HR.EMPLOYEES
WHERE EMPLOYEE_ID BETWEEN 130 AND 150;

The output looks like this:

I find functions convenient for inexpensive joins or operations like the example above. Functions generally make my code neater when used this way. Another use case for functions is to perform computations.

BONUS

You can practice what you have learnt in Oracle's live SQL platform. The link is here.

Database Triggers in PostgreSQL: A Deep Dive using AWS RDS

Ayokunle Adeniyi — Tue, 20 Feb 2024 01:07:04 +0000

Photo by Jr Korpa on Unsplash

Database triggers are special types of objects. They look like database procedures and functions but they are only executed in response to a certain event. The procedural code in the body is "triggered" by the specified DML or DDL operations on the database. The permitted operations are INSERT, UPDATE, DELETE (DML) and TRUNCATE (DDL) operations and the execution of the procedural code attached to the trigger could be BEFORE, AFTER or INSTEAD OF the operations, but exactly once per SQL statement or per modified row.

This blog post uses Postgres and PGSQL as the underlying technology

Therefore, there are multiple ways to classify triggers. These are

Based on WHEN you want the trigger function to be executed: These are BEFORE, AFTER or INSTEAD OF triggers.
Per-row or Per-statement triggers: The keywords that represent per-row triggers are FOR EACH ROW while the per-statement keyword is FOR EACH STATEMENT.

Note: FOR EACH STATEMENT is the default, if it is not declared.

A combination of both classes is required when building the trigger. That is, the trigger must have the WHEN clause (before, after or instead of) and must also either be per row or per statement.

The comprehensive documentation on how to combine both classifications can be found here.

Before we dive into the syntax and how to create a trigger in Postgres, let's consider the use cases for database triggers, its advantages and disadvantages

Use cases for database triggers

Auditing changes on DML events
Enforcing data and referential integrity that cannot be easily defined using constraints.
Enhancing security. For instance, preventing DML operations on a table after regular business hours
Gathering Statistics

Advantages

Triggers are efficient when used appropriately. For instance, to carry out an automated action, eliminating the need for manual intervention.
The functions attached to the trigger can be called in other code or attached to other objects. Therefore, potentially saving development time. A good example would be a trigger function to log all insertions on a table. This sort of function could be table-agnostic and highly reusable.
Triggers can offer a high level of control.
Just like other database objects, triggers can improve overall performance by moving workloads from the application layer to the database layer.

Disadvantages

Triggers can be complex to write and troubleshoot.
There is a performance overhead by introducing an additional workload when triggered by the DML operation.
They are programmatic and easy to alter or disable. Therefore, they can not be fully relied on as security mechanisms and must be used with caution in this case.
When used as constraints, they are more error-prone because they have to be developed.

Syntax

In Postgres, a trigger is made up of 2 separately defined parts. The function/procedure (basically the block of code that will be executed) and the trigger definition itself. We will get into the syntax shortly using code snippets, but before that, I will describe both the trigger function and the trigger specification.

The Trigger Function

The function must be defined before the trigger definition. it does not differ from the typical database function. However, the return type in the function specification must be TRIGGER. Below is an abridged version of a function specification

CREATE [ OR REPLACE ] FUNCTION
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
    [ RETURNS rettype ]
  { LANGUAGE lang_name
    | sql_body
  }

Where the rettype is replaced with the keyword TRIGGER

The Trigger Specification

To define the trigger, we use the CREATE TRIGGER keywords alongside other options. It is also here that the function created earlier is executed.

Trigger Syntax

CREATE [ CONSTRAINT ] TRIGGER name { BEFORE | AFTER | INSTEAD OF } { event [ OR ... ] }
    ON table_name
    [ FROM referenced_table_name ]
    [ NOT DEFERRABLE | [ DEFERRABLE ] [ INITIALLY IMMEDIATE | INITIALLY DEFERRED ] ]
    [ FOR [ EACH ] { ROW | STATEMENT } ]
    [ WHEN ( condition ) ]
    EXECUTE PROCEDURE function_name ( arguments )

where the event can be one of:

    INSERT
    UPDATE [ OF column_name [, ... ] ]
    DELETE
    TRUNCATE

Example: Audit use case

Let's create a trigger for one of my favourite use cases for triggers, which is auditing operations on a certain table.
Here, we will act as data professionals, setting up a trigger to audit changes to employee details

First, we create 2 tables, one to house employee details employee and another to capture relevant audit details employee_audit. All tables will be under the hr schema.

-- Create demo tables
CREATE table IF NOT EXISTS hr.employees  
( 
    employee_id    INTEGER CONSTRAINT emp_emp_id_pk PRIMARY KEY, 
    first_name     VARCHAR(20),
    last_name      VARCHAR(25)  CONSTRAINT emp_last_name_nn NOT NULL, 
    email          VARCHAR(25)  CONSTRAINT emp_email_nn NOT NULL, CONSTRAINT emp_email_uk UNIQUE (email),  
    phone_number   VARCHAR(20),
    hire_date      DATE  CONSTRAINT emp_hire_date_nn NOT NULL,
    job_id         VARCHAR(10)  CONSTRAINT emp_job_nn NOT NULL,  
    salary         NUMERIC(8,2)  CONSTRAINT emp_salary_min CHECK (salary > 0),
    commission_pct NUMERIC(4,2),
    manager_id     INTEGER CONSTRAINT emp_manager_fk REFERENCES employees(employee_id),
    department_id  INTEGER
);



CREATE table IF NOT EXISTS hr.employees_audit
( 
    employee_id    integer,
    first_name     VARCHAR(20),  
    last_name      VARCHAR(25),  
    email          VARCHAR(25),
    phone_number   VARCHAR(20),  
    hire_date      DATE,
    job_id         VARCHAR(10),
    salary         NUMERIC(8,2),  
    commission_pct NUMERIC(4,2),  
    manager_id     INTEGER,  
    department_id  integer, 
    date_changed   DATE constraint emp_aud_date_change not null,
    client_ip    VARCHAR(25),
    client_host_name VARCHAR(25),
    client_db_username VARCHAR(30),
    client_application varchar(80)
);

Remember! To create a trigger in Postgres, we must create the function that is going to be executed by the trigger. If you need a refresher on writing database functions, please visit the functions post in this series. Here we go!

CREATE OR REPLACE FUNCTION hr.log_employee_changes_function()
  RETURNS TRIGGER 
  LANGUAGE PLPGSQL
  AS
$$
declare
    date_of_change date := current_date;
    client_ip varchar(25);
    hostname varchar(25);
    client_db_user varchar(25);
    client_application varchar(80);
begin
    IF new <> old THEN 

        select pg_catalog.inet_client_addr() 
        into client_ip;

        select client_hostname
        into hostname
        from pg_catalog.pg_stat_activity
        where pid = pg_backend_pid();

        select usename
        into client_db_user
        from pg_catalog.pg_stat_activity
        where pid = pg_backend_pid();

        select application_name
        into client_application
        from pg_catalog.pg_stat_activity
        where pid = pg_backend_pid();

        insert
            into
            hr.employees_audit
        values(
            old.employee_id, 
            old.first_name,
            old.last_name,
            old.email,
            old.phone_number,
            old.hire_date,
            old.job_id,
            old.salary,
            old.commission_pct, 
            old.manager_id,
            old.department_id,
            date_of_change,
            client_ip,
            hostname,
            client_db_user,
            client_application
        ) ;

    END IF;

    RETURN NEW;
END;
$$

The above function returns a TRIGGER. In its body, it compares the new and old values to see if there is a change. If there is, it collects some additional information specific to that session such as the database user, IP address of the client, as well as the client application name and computes the date of that operation using the CURRENT_DATE built-in function.

All these are inserted into the employee_audit table. But when is it inserted? Is it before or after a change to the employee data? The only way we can find out is when we define our trigger using the CREATE TRIGGER keywords.

CREATE OR REPLACE TRIGGER log_employee_changes_trigger 
  BEFORE UPDATE
  ON hr.employees
  FOR EACH ROW
  EXECUTE PROCEDURE hr.log_employee_changes_function();

One thing to notice, the trigger name does not carry the HR schema as a prefix when being defined. This is because the trigger inherits the schema of the table and this is specified in the Postgres doc here.

TLDR
Triggers are special objects in a database that are executed when certain events happen. There are several cases when a database trigger could be useful such as enforcing constraints, tracking changes on tables that closely relate to business processes, and more.
Although useful, they have their pros and cons. Some advantages are that they provide reusable code, and can be highly efficient, while on the flip side, they cannot be relied on as a security mechanism and may be complex to troubleshoot.

Overall, we have created a fairly simple database trigger and we can be proud of ourselves for understanding the syntax. Please leave a comment on what you think database triggers can be used for in your organization, project etc.

How to write Database Procedures

Ayokunle Adeniyi — Thu, 15 Feb 2024 13:51:43 +0000

What are Database Procedures

Procedures are commonplace across several different occupations and processes. Simply put, the term 'procedure' is a series of actions conducted in a certain order or manner to achieve a particular result. More popularly, a surgical procedure or an 'Operation' are steps a surgeon would take to carry out a surgical operation on a person. In this series, the focus is on procedures as it relates to database operations. I am assuming we all know what a database is, but for those who don't, you can read it up here.

Procedures also commonly known as stored procedures can be defined, in a simple way, as a reusable block of code that represents a specific business logic stored in a schema. Just a quick example, as an HR manager that is responsible for staff remuneration and payroll; you would have periods where you might want to increase staff salary, change their pay grade, and different other related sub-processes. Achieving this manually (updating) would be very time consuming, cumbersome, and error-prone. With a simple procedure, this can be achieved quite easily at any time the need to perform that particular operation arises.

Let's get right into it.

As we all know, we have different types of databases, but this article would focus on procedures as it relates to Oracle RDBMS. No need to worry, across all databases running on standard SQL, you would only experience little syntactical differences. The examples in this article use the common HR schema and you can practice here. We are going to start by writing a procedure that takes in an employee ID and returns the employee's name.

Syntax

Generally, procedures have 2 parts which are the specification (spec for short) and its body.

Procedure Specification

The specification of the procedure serves as the entry point into the procedure. It contains, the procedure name, the parameters related to that procedure (Inputs and Outputs), as well as other optional clauses used to describe the procedure. The specification always starts with the keyword PROCEDURE and ends with a parameter list which is optional. The parameter list contains the input and output variables for that procedure. Procedures that do not require any parameter list have their specifications written without parenthesis.

As opposed to functions, procedures can have multiple outputs or none at all. Most times, procedures contain parameters because we need to do similar things but with different data. The procedure with the parameter list sample below has 2 parameters; an input parameter (employee ID with a number datatype) and an output parameter (employee name with a varchar2 datatype). The IN and OUT indicators specify if that parameter is an input into the procedure or an output from the procedure.

-- procedure specification sample

procedure get_emp_name(v_id in number, v_emp_name out varchar2)

The code snippet above shows a procedure specification with a parameter list. Each parameter is separated with a comma while the snippet below shows a procedure specification without a parameter list.

-- procedure specification without a parameter list

procedure get_all_emp_names

NOTE: General syntax is (parameter_name parameter mode parameter datatype) where the parameter is defined by the developer, the parameter mode is either 'IN' or 'OUT' followed by the corresponding data type of the parameter

Procedure Body

The procedure body starts after the specification with the keyword 'IS' or 'AS'. The body usually has 3 (three) parts namely:

The Declarative part (optional)
The Executable part
Exception handling part (optional) The declarative part: All the variables, data types, and cursors to be used throughout the procedure are declared in this section of the procedure body following the keyword 'IS' or 'AS'.

The executable part contains the program to be executed. This is where the logic of the procedure is written. It is the part that uses all the parameters specified, as well as the variables declared, in other to achieve the purpose of the procedure. Each valid SQL statement must be ended by ';' identifying the statement and a single statement and syntactically separating it from the following valid SQL statements in the procedure body.

Exceptions are handled in the exception-handling part of the procedure body. Exception handling is very important in general programming. as this prevents our code from crashing when it runs into an unexpected error. A very simple exception handler code snippet will be used in this article as an example.

-- declarative part

IS
/* declaring one variable called l_name */ 

l_name VARCHAR2(20);

-- executable part (the real logic of the procedure)
begin

    select first_name  || ' ' || last_name 
    into l_name                            -- Variable to hold the result of the query
    from employees
    where employee_id = v_id;   -- Input specified in the specification of the procedure

    v_emp_name := l_name;       -- Output from the procedure.

end get_emp_name;

Note: The exception handler will be inside the 'BEGIN' and 'END' code block. The exception block code snippet will return 'exception block, employee does not exist' on any exception encountered during the execution of the code between the BEGIN and END block.

BEGIN

    /* insert executable code here */

    -- exception handler
    exception 
        when others then
        v_emp_name := 'exception block, employee does not exist';

END get_emp_name;

Putting it all together

Putting it all together, we have

CREATE OR REPLACE PROCEDURE get_emp_name(v_id IN NUMBER, v_emp_name OUT VARCHAR2)
IS  
    -- declarative part
    l_name VARCHAR2(20);

    -- executable part
BEGIN
    SELECT first_name  || ' ' || last_name 
    INTO l_name
    FROM employees
    WHERE employee_id = v_id;

    v_emp_name := l_name;

    -- exception handler
    EXCEPTION 
        WHEN OTHERS THEN
    v_emp_name := 'exception block, employee does not exist';

END get_emp_name;

The above procedure takes in one parameter (v_id)and returns the employee's full name where the employee ID is the same as the parameter v_id. In the case where the employee ID does not exist, the exception block is triggered and the procedure returns 'exception block, employee does not exist' as the output from the procedure.

To conclude, procedures are perfect for writing certain business logic that requires multiple statements that must all be executed. A simple and common example is a banking transaction in a banking system where one account is debited and another account is credited and the transaction is recorded in a journal. This operation in SQL, requires 2 update statements and an insert statement before we can say the transaction is successful. If any step is not done, we would want the other executed statements to be rolled back. Procedures are also good for preventing SQL injection and usually have a performance improvement when compared to individually connecting and executing each statement on the database.

Mastering Common Database Objects - Series 1 .. n

Ayokunle Adeniyi — Thu, 15 Feb 2024 13:32:38 +0000

In the series, I hope to cover some common database objects such as procedures, functions, triggers etc. while also providing frequent real-life scenarios where these objects are used. The scope of this series will be limited to the oracle PL SQL syntax. However, most objects exist in other database technologies and I am pretty sure that the knowledge gotten here will be transferable to other RDBMS systems.

Get ready!!

Each coming article will focus on one database object only and will contain code snippets. The objects that will be covered are:

Coming Soon

Cursors
Jobs

Energy Forecasting with LSTM Neural Network

Ayokunle Adeniyi — Mon, 12 Feb 2024 00:09:00 +0000

Several methods have been used in energy forecasting over the years. Methods from different disciplines, such as ARMA and ARIMA models from econometrics and probabilistic and regression models from the domain of statistics, which also has an intersection with the symbolic AI field, to name a few. This experiment forecasts energy demand using the Long Short-Term Memory (LSTM) Neural Network models.

Data Source

National Grid Electricity System Operators (National Grid ESO)

Google Collab Code

colab.research.google.com

References

Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10), 2451–2471.

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Yu, Y., Si, X., Hu, C., & Zhang, J. (2019). A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Computation, 31(7), 1235–1270. https://doi.org/10.1162/neco_a_01199

What to think about when designing, building, managing and operating data systems.

Ayokunle Adeniyi — Sat, 10 Feb 2024 21:17:07 +0000

This post draws extensively from a book I am currently reading and my experience managing data systems (primarily, core data infrastructure). The book is Designing Data-Intensive Applications by Martin Kleppmann.

Hey! What is even a Data System?

First, I would like to define a data system in my own words. But before that, I used to think of data systems as databases only, or in a simpler sense, anything that serves as a database. It could be physical sheets of paper, the popular Microsoft Excel, to more complex databases like Oracle, MySQL and MSSQL databases. In essence, I only thought of data systems as systems that store data. As I continue to grow in my career, I have begun to realise that almost all systems are, in some way, data systems. To put out a definition, I would say a data system is any system whether digital or analogue that needs data, processes data and/or stores data. It could be one of these things or a combination of some of the functions aforementioned. I say this because a system that needs that, influences the choices of every other system that is coupled with it.

A data system is a system that cares about the shape, size, type, and form of the data it needs to function as designed.

Fundamental concerns when thinking of designing data systems
There are certain things that one must consider when building efficient systems. These concerns apply to almost all types of digital systems. In this blog, I discuss them in the context of data systems specifically. Anyone designing a data system would have questions such as

What happens when we have double the traffic that we have now?
Can we see when something goes wrong, when the fault happens, what caused the fault, who caused it and can we re-create it?
How do we make that the system can recover from failure either manually or automatically (preferably) without losing or corrupting the data
Can we make changes to the system in the future?

If you are building a similar system and have these questions in mind, you are thinking in the right direction. These are typical questions that can be largely grouped into 3 main buckets, which we will delve into shortly. Furthermore, the questions are very broad questions and are far from fully covering all considerations when designing a data system, or even managing an already-built data system.

Back to the buckets 😅. The main areas of concern are reliability, scalability and maintainability.

Let's get cracking!!!!!

Reliability

When referring to people, being reliable is akin to being trustworthy. That is, do people trust you to do what you are tasked and expected to do, even with the numerous distractions life throws at us daily? Likewise, the same is expected from the digital systems we use, much so the mundane everyday systems. We all want our messaging apps to keep our chats no matter what happens to our phones or the servers hosting. Reliability has become an expectation in our society.

Reliability refers to the correct functioning of a system during its life span according to its design expectations. It also entails the ability of a system to tolerate and recover from faults and failures. Other aspects of reliability involve ensuring access to only authorised personnel and keeping the system secure.

Fault and failures are often used interchangeably; However, they differ from each other, such that, a fault is an issue that results in the deviation from its expected behaviour while a failure occurs when the system fails to provide the required service. It is usually as a result of several faults.

Typical categories of faults that occur in data systems are

Hardware faults: These consist of faults such as Disk crashes, power disruption, cable wear, faulty RAMs e.t.c. You typically hear terms such as mean time between failures (MTBF) and mean time to failure (MTTF). IBM has a good blog post explaining both terms here.

Software errors: They are also referred to as bugs. Sometimes, software errors could trigger hardware faults. E.g. a bug at the kernel level could cause the hard disk to crash.

Human errors: The most common source of errors is human errors. Even with the best intentions, we still tend to make numerous errors. In cybersecurity, it is often said that humans are the weakest links and that shows how error-prone we are although we design and build the most robust systems. Being the most common source of errors, here are some tips on dealing with human errors

Decouple components and aspects of the system where humans tend to make the most errors. E.g Make configuration modular instead of one big configuration

Monitor, Monitor, Monitor. Collect telemetry data relevant to the state of a system in parts and as a whole.

Employ thorough and robust testing. Make use of unit tests and integrated tests to make sure the system acts as intended. Netflix uses an interesting testing method called Chaos Monkey.

Well-designed abstractions can ensure that we interact with systems appropriately by minimizing what we can. Hence, eliminating the room for human errors

Design to recover quickly. Make error logs understandable, document common faced errors and recovery steps

Scalability

Photo by NASA on Unsplash

Scale can refer to different things in their respective contexts. Concerning data systems, we will focus on one fairly broad concept central to many of the questions commonly asked when scaling vertically (up or down), or scaling horizontally (out or in). That central idea is defining your load, because the scalability of a system can be narrowed down to the question - "How will this system perform if we increase or decrease the load by a factor of X?"

Essentially load is the amount of computational work done by a system. To define load appropriately, it must be described by some numbers referred to as load parameters. These parameters differ per system. It could be in the number of users, requests per second, size of each request, writes vs reads per second, number of cache hits and misses and so on. The idea I am trying to pass here is that defining the load of a system using load parameters will many times differ based on the architecture of the system, and a good way to measure the performance is by using percentiles as opposed to using an average. This is because using percentiles allows you to estimate the number of affected users intuitively.

Take, for instance, an application, serving 1 million unique users, connecting to a database that has most of its requests processed within 10ms and 1s. Now, that is not a long time in reality but it is critical when chaining different requests to provide a single functionality. If the median response time is 30ms, and the 95th percentile mark is 50ms, we can then easily say that 50,000 (5%) users are experiencing the slowest response times over a given period. Also, the above scenario shows that half of the user base has response times of 30ms and below.

Ways to deal with varying load.

When it comes to adapting to load, there is usually a dichotomy of increasing the capacity of your system by increasing the number of machines (scaling horizontally) or by adding resources such as CPU and Memory (scaling vertically). Furthermore, there is a likelihood that both scaling approaches would be used within an entire system as horizontal scaling can become incredibly complex to manage when the members are a lot, especially for stateful systems like data systems.

I like to think of stateful applications as systems where all members must be aware of the current state of the system to act on anything. Therefore, in multi-node architecture where data is across several nodes, all nodes must have a way to know the current state of the entire system.

One key thing to note when considering the multiple approaches is the operability of the system. In choosing your architecture, it is essential to make sure that is it relatively easy to manage, optimize, make changes to and is resilient.

Maintainability

This aspect is so important, that there is no point in building a system if it cannot be maintained. Just don't do it. In my experience, I spend a lot of time thinking of how to make any system I work on, easier to maintain. In fact, most of the cost of software is in maintaining it - integrating to new systems, adapting to changes in environment and technology, bug fixes, vulnerability fixes etc.

Maintainability is a big word, so let us define it and deconstruct what it means for a system to be maintainable. IEEE Standard Glossary of Software Engineering Terminology defines maintainability as: "The ease with which a software system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment."

Making a system maintainable takes a lot of effort and planning. It is essential to think about maintenance as early as possible when designing a system. There is no one-size-fits-all method to do this, but there are principles that can guide you to achieving this. These design principles for data systems are:

Simplicity: this refers to building your system, such that, new engineers and operators find it easy to understand. Remember, today's system is potentially tomorrow's legacy system and we know we like a properly built legacy system.

Auditability: In my years working on data platforms, the most frequent requests were around who did what on the system and when they did it. Having that fine-grained access control and visibility on what happens can play a vital role in understanding internal user patterns, tracking and reproducing bugs etc.

Evolvability: No system is likely to stay the same forever, it must evolve to rapidly changing user needs, business needs and other factors such as technological advancement. Systems must be malleable to cater to these changing needs

Operability: Systems must be easy to operate. Good abstractions can make it easy to operate and run smoothly.

The principles mentioned above may not be exhaustive but will cover most of your maintenance needs if you brainstorm on each carefully. Also, it is important to involve other stakeholders when designing systems.

If you have reached this part of this article, I must say a big thank you 🎉 for getting to the end. To recap, data systems are more than just databases and analytical systems but extend to other systems that care about the data they receive and give. Because these systems are prominent, they must be designed and built carefully. In so doing, we discussed three (3) main areas to really think about. Those areas are Reliability, Scalability and Maintainability. Reliability is concerned with making the system fault-tolerant, scalability deals with ensuring the system performs optimally even when the load rapidly changes, and maintainability refers to a system being simple and able to evolve.

Feel free to drop a comment, feedback or question. It will be much appreciated.