Sunday, February 10, 2013

The Importance of Normalcy

In most cases, I champion being unique, and in most cases, that means not being normal. In the case of databases, however, that isn’t the case. In the lingo of database land, being Normal is to be Unique.

Early Databases

In the earliest times, the idea of a database was primitive. If you even had something called a database, it didn’t look much like our databases today. For the most part, programs stored their information in clever flat-file structures. Let’s imagine that you wanted a list of employees at a company. Some are developers, others are managers, some are in the Android department, some in the iOS department, some in the design department, and some in the web development department. An early application might have treated this data as a fancy sort of CSV -- merely separating each of the employee’s fields with a delimiter.

While this works fine for smaller numbers of people, you begin to encounter severe performance problems when you want to manipulate the data based on certain fields. At first, this isn’t a problem. Finding all the people who work under a manager as as simple as matching the “manager” field to the manager’s name. Of course, the wacky new employee (Moreena) who decided to enter their manager’s name as “Kymberlie” instead of “Kimberly” isn’t going to show up. It would make things simpler if the system instead showed options for existing managers so that Moreena can simply select her manager from a list. Generating a list of all existing managers now requires scanning every existing record in the document, a time consuming operation.

The Relational Revolution

Eventually, a new concept, the relational database, was born. Enter: Normalcy. A relational database was so called because it took these spreadsheet like structures, called them tables, and allowed you to specify relationships between them. Along with this came the concept of normalization, meaning that the database was structured in such a way that data was repeated as little as possible, and organized in a way that was as efficient as possible.

If we return to our example earlier, we would no longer store each employee as a single record with all of the information; at least, not exactly. Since multiple people work for the same manager, that manager’s name is duplicated data. We then take that data and “Normalize” it, by putting it in a separate table. The same is true of departments. Each table gets a special column called a primary key. This key is a unique internal way to identify each item. By convention, this is called the “id”. What’s important is that the unique key is internal to the database, and is linked but not equivalent to the actual text. To associate an employee with a manager and with a department, we define two fields for each employee which are a special type called a foreign key which points to the identification column in the table with the information we want to associate. There is now a simple, concise, table which lists the company’s managers, one of whom is Kimberly. When Moreena goes to select a manager, the database no longer needs to scan an entire document. Instead, it simply lists the small list of managers, and when she selects Kimberly from the list, inserts the unique identifier into her manager field as a foreign key, directly linking her to her manager in an efficient way.

Being Normal

Keeping your databases normalized is absolutely important both because it improves the efficiency of your queries, and also because it maintains the integrity of your data. Given our example, if Kimberly decides that she actually likes the way Moreena spelled her name, and gets it legally changed, it would be difficult to apply this to a monolithic structure. It would require the equivalent of a massive find-and-replace operation. Of course, that may result in Kimberly-the-housekeeper who takes out the trash getting her name changed as well. In a normalized database, since only the unique internal identifier links the manager to the employee, you can simply change the manager’s name, and leave the id the same. Thus, the next time it lists your manager, it shows the new name, as referenced by the id. Additionally, modern databases provide functions that further speed up queries on normalized data, allowing even a moderately powerful server to handle literally millions of data elements and return complex queries in a fraction of a second.

No comments:

Post a Comment