Tuesday, October 21, 2008

normalization

Database normalization, sometimes referred to as canonical synthesis, is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only.

Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, less highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an automated teller machine), while more normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application).[citation needed]

Database theory describes a table's degree of normalization in terms of normal forms of successively higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not necessarily the case.

Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized.

Normal forms

The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.

The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n.

Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.

Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These normal forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key". The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-to-many and one-to-many relationships among attributes. Sixth normal form (6NF) incorporates considerations relevant to temporal databases.

[edit] First normal form

Main article: First normal form

A table is in first normal form (1NF) if and only if it represents a relation.[3] Given that database tables embody a relation-like form, the defining characteristic of one in first normal form is that it does not allow duplicate rows or nulls. Simply put, a table with a unique key (which, by definition, prevents duplicate rows) and without any nullable columns is in 1NF.

Note that the restriction on nullable columns as a 1NF requirement, as espoused by Chris Date, et. al., is controversial. This particular requirement for 1NF is a direct contradiction to Dr. Codd's vision of the relational database, in which he stated that "null values" must be supported in a fully relational DBMS in order to represent "missing information and inapplicable information in a systematic way, independent of data type."[4] By redefining 1NF to exclude nullable columns in 1NF, no level of normalization can ever be achieved unless all nullable columns are completely eliminated from the entire database. This is in line with Date's and Darwen's vision of the perfect relational database, but can introduce additional complexities in SQL databases to the point of impracticality.[5]

One requirement of a relation is that every table contains exactly one value for each attribute. This is sometimes expressed as "no repeating groups"[6]. While that statement itself is axiomatic, experts disagree about what qualifies as a "repeating group", in particular whether a value may be a relation value; thus the precise definition of 1NF is the subject of some controversy. Notwithstanding, this theoretical uncertainty applies to relations, not tables. Table manifestations are intrinsically free of variable repeating groups because they are structurally constrained to the same number of columns in all rows.

Put at its simplest; when applying 1NF to a database, every record must be the same length. This means that each record has the same number of fields, and none of them contains a null value.

[edit] Second normal form

Main article: Second normal form

The criteria for second normal form (2NF) are:

  • The table must be in 1NF.
  • None of the non-prime attributes of the table are functionally dependent on a part (proper subset) of a candidate key; in other words, all functional dependencies of non-prime attributes on candidate keys are full functional dependencies.[7] For example, consider an "Employees' Skills" table whose attributes are Employee ID, Employee Name, and Skill; and suppose that the combination of Employee ID and Skill uniquely identifies records within the table. Given that Employee Name depends on only one of those attributes – namely, Employee ID – the table is not in 2NF.
  • In simple terms, a table is 2NF if it is in 1NF and all fields are dependent on the whole of the primary key, or a relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate key of the relation.
  • Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key consists of just one attribute – then we can say immediately that the table is in 2NF.
  • All columns must be a fact about the entire key, and not a subset of the key.

[edit] Third normal form

Main article: Third normal form

The criteria for third normal form (3NF) are:

  • The table must be in 2NF.
  • Every non-key attribute must be non-transitively dependent on the primary key.

All attributes must rely only on the primary key. So, if a database has a table with columns Student ID, Student, Company, and Company Phone Number, it is not in 3NF. This is because the Phone number relies on the Company. So, for it to be in 3NF, there must be a second table with Company and Company Phone Number columns; the Phone Number column in the first table would be removed.

[edit] Boyce-Codd normal form

A table is in Boyce-Codd normal form (BCNF) if and only if, for every one of its non-trivial functional dependencies X → Y, X is a superkey—that is, X is either a candidate key or a superset thereof.[8]

[edit] Fourth normal form

Main article: Fourth normal form

A table is in fourth normal form (4NF) if and only if, for every one of its non-trivial multivalued dependencies X Image:twoheadrightarrow.gif Y, X is a superkey—that is, X is either a candidate key or a superset thereof.[9]

  • For example, if you can have two phone numbers values and two email address values, then you should not have them in the same table.

[edit] Fifth normal form

Main article: Fifth normal form

The criteria for fifth normal form (5NF and also PJ/NF) are:

  • The table must be in 4NF.
  • There must be no non-trivial join dependencies that do not follow from the key constraints. A 4NF table is said to be in the 5NF if and only if every join dependency in it is implied by the candidate keys.

[edit] Domain/key normal form

Domain/key normal form (or DKNF) requires that a table not be subject to any constraints other than domain constraints and key constraints.

[edit] Sixth normal form

According to the definition by Christopher J. Date and others, who extended database theory to take account of temporal and other interval data, a table is in sixth normal form (6NF) if and only if it satisfies no non-trivial (in the formal sense) join dependencies at all,[10], meaning that the fifth normal form is also satisfied. When referring to "join" in this context it should be noted that Date et al. additionally use generalized definitions of relational operators that also take account of interval data (e.g. from-date to-date) by conceptually breaking them down ("unpacking" them) into atomic units (e.g. individual days), with defined rules for joining interval data, for instance.[11]

Sixth normal form is intended to decompose relation variables to irreducible components. Though this may be relatively unimportant for non-temporal relation variables, it can be important when dealing with temporal variables or other interval data. For instance, if a relation comprises a supplier's name, status, and city, we may also want to add temporal data, such as the time during which these values are, or were, valid (e.g. for historical data) but the three values may vary independently of each other and at different rates. We may, for instance, wish to trace the history of changes to Status.

For further discussion on Temporal Aggregation in SQL, see also Zimyani, [12] For a non-relational approach, see TSQL2.

In a different meaning, sixth normal form may also be used by some to refer to Domain/key normal form (DKNF).

No comments: