Knowing the data in designing software systems

Knowing the data in designing software systems

Most of the modern software systems like consumer products (social media, advertisements) have data as one of the most important components in the software. Many applications are built on top of analytics on various sensor data. In fact in the book "The Second Machine Age" by authors Erik Brynjolfsson and Andrew McAfee, the authors have identified data sciences as the driver of technological innovation in the second decade of the 21st century.

A data-driven software system is a software system that processes raw data to generate more meaningful interpretations or information from the raw data. This activity can be as simple as cleaning the raw data to generate a structured data view or performing complex analytics to find if two persons are likely to meet over for coffee. To get any sort of useful information or analytical outcome from the data, we should first know the intricacies of the data. Among the many things that we should know about the data are the organization of the data within the software system and database, how the data behaves when it is modified and the parts of the data which are frequently modified (I call these the "hotspots") are few of the generic and important points that apply to software systems with better design. In this post, I briefly touch upon the aforementioned three important characteristics of the data to keep in mind while designing software. These are lessons from my own experience and may be limited in their scope. However, I have experienced that if we neglect the aforementioned aspects of data, the software system generally runs into deeper design problems and software bugs.

Organization of the data within the software

Data is often read from a database like Mysql where the data is organized in relational tables. While it is important to understand the schema of the table and the data structure that is used in each of the table columns, the more important thing to consider is how your software system uses the data present in the database table. For instance, a database table that stores the network addresses in network byte order will force the application, that uses this table, to convert the network byte order address into host byte order address every time an entry is read from this database table. Frequent conversion from network byte order to host byte order may result in unnecessary computational load for the software system. If we cannot change the database schema of the table to store addresses in host byte order, we could use caching of the network addresses in host byte order after reading the network byte order addresses from the database table. This could save us the time that the software needs to fetch the record from the database and perform the conversion from network byte order to host byte order. Hence we should understand how our software uses the data to design better data structures and program flows that interface well with the data source.

How does the data behave when it is modified, created or deleted?

Traditional databases allow users to specify alerts on database tables. An alert is a notification sent to the user program when some entry in the table of the program's interest changes. The change can be the creation of a row entry, updation of a row entry or deletion of a row entry in the database table. In all three events, the program is expected to set its state and act accordingly. Responding to a create or update request may be trivial enough to envision and handle. All the program has to do is to see what got added or modified and set its internal state accordingly. The handling of the deletion of a record from the database is, however, non-trivial in some databases. Modern database management protocols like OVSDB (Open vSwitch Database Management Protocol) usually do not specify which rows in the database table got deleted. They provide the current snapshot of the database tables. Hence to be able to identify which entries in the database table got deleted, one has to maintain the shadow copy of the database table in the program memory and compare this memory snapshot with the current database snapshot to find which rows of the database table got deleted. The above-explained handling of the deletion procedure is computationally intensive as we have to walk all the database table entries to figure out what table entry got deleted.

Hence, it is important to design your software keeping how the CRUD operations affect the database and interfacing software. Each of the create, read, update and delete operations in CRUD could have different implications on computation time and system memory utilization.

Parts of the data that are frequently modified or hot spots in the data

Another important property of data to keep in mind is which parts of the data are more frequently modified. This knowledge about the data is useful for designing efficient and scalable software systems. If we know which parts of the data get modified more frequently, we will be able to design the data structures of our program so that we can efficiently handle the frequent churn caused by these hot spots in the data.

Consider a dummy example. Let there be a two-column database table denoted by <A, B> where both A and B are positive integers. Let us say that we store an index for faster look-ups on the tuple <A, B> in a hash table with A acting as the key to this hash table. If we were to search for a tuple <A, B> with value 'A', we can get this tuple in constant time (assuming that we used the world's greatest hash function to evenly spread the tuples <A, B> in the hash table). However, if were to search the same hash table for another tuple <A, B> with value 'B', we would be required to walk the entire hash table (too bad, even the world's best function didn't help us out in this case!). If a program using this design of the hash table were to process a lot of queries with search value on column B, this program would walk through the entire hash table every time such a query is made. Hence, the design of the hash table based on column 'A' is not suitable for this program. The program requires the hash table to be ordered according to column 'B' and not column 'A' for better performance. Since column B is the more frequently searched and queried element it may be referred to as the hot spot in the data for this program. Hence knowing which columns of the database table are more likely to be modified frequently is useful in designing better data structures for your software.