Next Page | Prev Page | Main Index | Home |
The process used in preparing data to be loaded into a relational
database is called normalization. This process involves taking
a single-source file and breaking it up into several related tables.
The goal for each table is to contain only those data fields (columns:
in RDBMS terms) that reference "the key," "the
whole key" and "nothing but the key." In addition,
repeating fields are forbidden. This results in queries normally
needing to join several tables to produce a useful result.
The process used in preparing data to be loaded into MPbase
is called naturalization. This process involves taking several
source files and combining them into a very small number of tables.
The goal for each table is to contain all of the data fields that
relate to one natural order. This results in queries normally
needing to subset one or maybe two such tables to produce a useful
result.
While the rules defining relational databases can be beneficial
to users and programmers. They are a nightmare to a DBMS's internals.
MPbase's solution is to present the user with a de-coupled
view of the data. What the user sees has very little to do with
how the information is physically stored. This allows "views"
of the data to look, act and feel relational, even though the
database internally is not.
As an added advantage, the same MPbase can have different
views supporting different applications and users. These views
can be "relational," "hierarchical," or "multidimensional."
In addition, the views can support different data formats and
communication protocols. So an MVS mainframe may be looking at
a hierarchical view in EBCDIC from TSO, while at the same time
a Windows PC is seeing the same information as relational ASCII
from a Visual Basic application.
The process used to prepare data for loading into MPbase
is unique to computerized information systems. It most closely
resembles the process used by file clerks in the days when businesses
were run with people and paper. Through a set of statistical functions
the process attempts to find the most natural order for a given
set of data. This ordering allows a natural segmentation of the
data into a massively parallel framework.
At the same time this process provides, as a by-product, information
about any data that does not fit into the natural order. For example,
this represents all of the data in the tails of the classic bell
curve. This information is invariably one of two types: data in
error or interesting data. In either case, this is data you need
to know about.
A quick example is NPSI's phone book demo. Start with 50 files,
one for each state. Create two linked tables, one in natural order
by name, the second by geography. Produce 80,000+ physical files
supporting the two tables. Now, what you have is one logical table
viewable as any type of database that makes sense to your application.
This naturalized data would produce incredibly inefficient use
of a conventional RDBMS; however, in MPbase this same structure
is extremely efficient. In the phone book example above, the space
used went from an ASCII comma-separated value (CSV) size of 12.3
gigabytes to an MPbase size of 1.7 gigabytes. It is important
to remember that small size is a byproduct the goal is speed.
The last example applies to data extracted in a denormalized form.
While helpful, it does not directly address the issue of what
happens to table count when converting from a traditional RDBMS
to MPbase. A better example in this case is the TIGER mapping
database. This is the U.S. census bureau's mapping database. The
document describing this database is over 270 pages. The data
comes in zipped archives by county.
The database is in a normalized form containing 17 tables. The
first step was to fully de-normalize the data. This produced just
two ASCII CSV files. These files were then naturalized, producing
one table and two indexes. This is supported by only three physical
files for each county. Seventeen files with over five Mbytes of
data produced three files of less than 400 Kbytes total.
It is generally understood that any query needing hundreds of joins is inherently slow and resource-intensive. MPbase overcomes this issue by drastically reducing the number of logical tables in a database. At the same time naturalization provides a substantial performance boost over and above the reduction in join-processing. When added to the decreased storage requirements, this presents the best possible solution for any large or complex system.
Next Page | Prev Page | Main Index | Home | © 1998-2004 NPSI |