Next Page Prev Page Main Index Home

Naturalization vs. Normalization
The only logical choice

The process used in preparing data to be loaded into a relational database is called normalization. This process involves taking a single-source file and breaking it up into several related tables. The goal for each table is to contain only those data fields (columns: in RDBMS terms) that reference "the key," "the whole key" and "nothing but the key." In addition, repeating fields are forbidden. This results in queries normally needing to join several tables to produce a useful result.

The process used in preparing data to be loaded into MPbase is called naturalization. This process involves taking several source files and combining them into a very small number of tables. The goal for each table is to contain all of the data fields that relate to one natural order. This results in queries normally needing to subset one or maybe two such tables to produce a useful result.

While the rules defining relational databases can be beneficial to users and programmers. They are a nightmare to a DBMS's internals. MPbase's solution is to present the user with a de-coupled view of the data. What the user sees has very little to do with how the information is physically stored. This allows "views" of the data to look, act and feel relational, even though the database internally is not.

As an added advantage, the same MPbase can have different views supporting different applications and users. These views can be "relational," "hierarchical," or "multidimensional." In addition, the views can support different data formats and communication protocols. So an MVS mainframe may be looking at a hierarchical view in EBCDIC from TSO, while at the same time a Windows PC is seeing the same information as relational ASCII from a Visual Basic application.

The process used to prepare data for loading into MPbase is unique to computerized information systems. It most closely resembles the process used by file clerks in the days when businesses were run with people and paper. Through a set of statistical functions the process attempts to find the most natural order for a given set of data. This ordering allows a natural segmentation of the data into a massively parallel framework.

At the same time this process provides, as a by-product, information about any data that does not fit into the natural order. For example, this represents all of the data in the tails of the classic bell curve. This information is invariably one of two types: data in error or interesting data. In either case, this is data you need to know about.

A quick example is NPSI's phone book demo. Start with 50 files, one for each state. Create two linked tables, one in natural order by name, the second by geography. Produce 80,000+ physical files supporting the two tables. Now, what you have is one logical table viewable as any type of database that makes sense to your application.

This naturalized data would produce incredibly inefficient use of a conventional RDBMS; however, in MPbase this same structure is extremely efficient. In the phone book example above, the space used went from an ASCII comma-separated value (CSV) size of 12.3 gigabytes to an MPbase size of 1.7 gigabytes. It is important to remember that small size is a byproduct the goal is speed.

The last example applies to data extracted in a denormalized form. While helpful, it does not directly address the issue of what happens to table count when converting from a traditional RDBMS to MPbase. A better example in this case is the TIGER mapping database. This is the U.S. census bureau's mapping database. The document describing this database is over 270 pages. The data comes in zipped archives by county.

The database is in a normalized form containing 17 tables. The first step was to fully de-normalize the data. This produced just two ASCII CSV files. These files were then naturalized, producing one table and two indexes. This is supported by only three physical files for each county. Seventeen files with over five Mbytes of data produced three files of less than 400 Kbytes total.

It is generally understood that any query needing hundreds of joins is inherently slow and resource-intensive. MPbase overcomes this issue by drastically reducing the number of logical tables in a database. At the same time naturalization provides a substantial performance boost over and above the reduction in join-processing. When added to the decreased storage requirements, this presents the best possible solution for any large or complex system.

Next Page Prev Page Main Index Home © 1998-2004 NPSI