William Kent, "Fact-Based Data Analysis and Design", Entity-Relationship Approach to Software Engineering, North Holland, 1983 (Davis, Jajodia, Ng, Yeh, eds.). (Proc. Third Intl. Conf. on Entity-Relationship Approach, Oct. 6-8, 1983, Anaheim, California.) Also in Australasian Share/Guide Newsletter No. 15, March 1984. Also in Proc. Symposium on Data Base Management Systems, Nov. 15-17, 1983, Sydney, Australia. Also in Journal of Systems and Software 4(2,3), July 1984,99-122. Also IBM Technical Report TR03.235, July 1983. [48 pp]


Fact-Based Data Analysis and Design

William Kent
1983 


> 1 INTRODUCTION . . . 3
>> 1.1 Introductory Example . . . 5
>>> 1.1.1 Specify the Facts to be Maintained . . . 5
>>> 1.1.2 Generate a "Pseudo-Record" For Each Fact . . . 5
>>> 1.1.3 Identify "Pseudo-keys" For Each Pseudo-record . . . 6
>>> 1.1.4 Merge Pseudo-Records Having Compatible Keys . . . 7
>>> 1.1.5 Assign Representations . . . 8
>>> 1.1.6 Name the Fields and the Record Types . . . 9
>>> 1.1.7 Consider Alternatives . . . 10
>> 1.2 A Skeleton of the Methodology . . . 11
> 2 THE UNDERLYING IDEAS . . . 13
>> 2.1 Facts . . . 13
>>> 2.1.1 Facts Connect Things . . . 13
>>> 2.1.2 How Many Things? . . . 14
>>> 2.1.3 Why They're Connected . . . 14
>>> 2.1.4 The Roles They Play . . . 15
>>> 2.1.5 Example . . . 15
>> 2.2 Entities and Relationships . . . 16
>>> 2.2.1 Entities . . . 16
>>>> 2.2.1.1 Instances and Types . . . 16
>>>> 2.2.1.2 Subtypes . . . 17
>>> 2.2.2 Relationships . . . 17
>>>> 2.2.2.1 Participation . . . 17
>>>> 2.2.2.2 Unary Relationships . . . 18
>>>> 2.2.2.3 Attributes of Relationships (Facts About Facts) . . . 19
>>> 2.2.3 The Distinction Between Facts and Relationships . . . 19
>> 2.3 Representation . . . 19
>>> 2.3.1 Why Represent? . . . 20
>>> 2.3.2 How Represent? . . . 20
>>>> 2.3.2.1 Symbol Types . . . 21
>>>> 2.3.2.2 Symbol Facts . . . 21
>>> 2.3.3 Other Representation Techniques . . . 22
>>>> 2.3.3.1 Structured Symbols . . . 22
>>>> 2.3.3.2 Derived and Indirect Representations . . . 23
>>>> 2.3.3.3 Non-Representation . . . 23
>>>> 2.3.3.4 Compressed Representations . . . 24
>>>> 2.3.3.5 Mapping Techniques . . . 24
>>> 2.3.4 "Quality" of the Representation . . . 24
>>>> 2.3.4.1 Existence . . . 24
>>>> 2.3.4.2 Completeness . . . 25
>>>> 2.3.4.3 Uniqueness . . . 25
>>>> 2.3.4.4 Singularity . . . 25
>>>> 2.3.4.5 Other Qualities . . . 26
>>> 2.3.5 Using Combinations of Facts . . . 27
>> 2.4 Other Design Concepts . . . 28
>>> 2.4.1 Entities and Records . . . 28
>>> 2.4.2 Existence Management . . . 29
>>> 2.4.3 Complex Situations . . . 30
>>> 2.4.4 Normal Forms and Normalization . . . 30
>>> 2.4.5 Derivable Facts . . . 30
> 3 THE DESIGN PROCESS . . . 31
>> 3.1 Specify the Facts to be Maintained . . . 31
>>> 3.1.1 Examples . . . 32
>> 3.2 Generate a "Pseudo-Record" for Each Fact . . . 33
>>> 3.2.1 A Shortcut . . . 35
>> 3.3 Identify "Pseudo-keys" For Each Pseudo-record . . . 35
>> 3.4 Merge Pseudo-records Having Compatible Keys . . . 36
>>> 3.4.1 Merging on Simple Keys . . . 37
>>>> 3.4.1.1 Ideal Merges . . . 37
>>>> 3.4.1.2 Unequal Populations . . . 38
>>>> 3.4.1.3 Subtypes . . . 39
>>>> 3.4.1.4 Summary . . . 39
>>> 3.4.2 Merging on Compound Keys . . . 39
>>> 3.4.3 Pseudo-Records With Multiple Keys . . . 40
>>> 3.4.4 Merging is Optional . . . 41
>>> 3.4.5 Merge vs. Join . . . 42
>> 3.5 Assign Representations . . . 42
>>> 3.5.1 Example . . . 43
>> 3.6 Alternative Designs . . . 46
>> 3.7 Naming Fields and Record Types . . . 47
>> 3.8 Iteration . . . 48
> 4 CONCLUSION . . . 48
>> 4.1 Highlights of This Approach . . . 48
>> 4.2 Indirect Benefits . . . 49
>>> 4.2.1 Conceptual Models . . . 49
>>> 4.2.2 Data Documentation . . . 49
>>> 4.2.3 Understanding the Application . . . 49
>> 4.3 Convergence With Other Work . . . 49
> 5 REFERENCES . . . 50


1 INTRODUCTION

By focusing on the facts to be maintained in a database, we obtain a methodology for data analysis and design which is at once simpler and more powerful than other methodologies.

The methodology is made simple by:

In addition, we provide extended capabilities to deal with more difficult representation situations, and to examine alternative data designs.

We focus on the middle portion of the analysis and design process: translating entities and relationships into record designs. This paper does not describe procedures for identifying the entities and relationships, nor does it deal with physical data design.

We go a little beyond what might ordinarily be considered "logical data design". While this methodology can produce records in third or even fifth normal forms [K4], our aim is to produce the actual record designs that will be implemented in a database, even if they might be inelegant or redundant in some ways. What we do not deal with are the other aspects of resource and access path management, such as storage mappings and index schemes.

We deal only with flat files. The methodology applies to relational databases, to other "flat file" data management systems, and to the design of the segment types in a hierarchical database or the record types in a network database. Non-flat data models (e.g., hierarchies and networks) will require additional design steps to replace some fields with structural paths (pointers).

The methodology has not yet been developed to the point of a detailed procedure. This paper describes the concepts on which the methodology is based, and the major steps in the methodology.

The first step is to identify the facts that are to be maintained, in terms of entities and relationships. Thereafter, the methodology essentially

These two results are accomplished together, in the following steps:

Other steps in the methodology include:

While the approach is fundamentally simpler than others, it sometimes appears more elaborate because it deals with more complex and realistic situations, including:

In the following sections, we will:

1.1 Introductory Example

This simple example illustrates the essential framework of the methodology. Many refinements are not taken into account here.

1.1.1 Specify the Facts to be Maintained

The facts:

We present the same facts in diagram form:

has-emp# has-name has-dept# has-name ........ ........ ......... ........ : : : : : : : : EMP EMP# EMP NAME DEPT DEPT# DEPT NAME assigned earns manages ........ ....... ....... : : : : : : EMP DEPT EMP MONEY EMP DEPT

Such diagrams of the facts lead almost mechanically into the next step...

1.1.2 Generate a "Pseudo-Record" For Each Fact

has-emp# has-name has-dept# has-name ........ ........ ......... ........ : : : : : : : : -------------- -------------- ---------------- --------------- | EMP | EMP# | | EMP | NAME | | DEPT | DEPT# | | DEPT | NAME | -------------- -------------- ---------------- --------------- 1 2 3 4 5 6 7 8 assigned earns manages ........ ....... ....... : : : : : : -------------- --------------- -------------- | EMP | DEPT | | EMP | MONEY | | EMP | DEPT | -------------- --------------- -------------- 9 10 11 12 13 14

Representations will be dealt with in a general way at a later step. But when some of the facts serve to assign obvious unique identifiers to things, then we can take a shortcut at this point by replacing such things with their identifiers. We do this now for employees and departments:

has-name has-name ........ ........ : : : : --------------- ---------------- | EMP# | NAME | | DEPT# | NAME | --------------- ---------------- 3 4 7 8 assigned earns manages ........ ....... ....... : : : : : : ---------------- ---------------- ---------------- | EMP# | DEPT# | | EMP# | MONEY | | EMP# | DEPT# | ---------------- ---------------- ---------------- 9 10 11 12 13 14

1.1.3 Identify "Pseudo-Keys" For Each Pseudo-Record

The concept: if we merely assume that each of these "fields" actually contains a unique identifier, then what would be the keys in each pseudo-record?

We use "====" to indicate the pseudo-keys.

has-name has-name ........ ........ : : : : --------------- ---------------- | EMP# | NAME | | DEPT# | NAME | --====--------- --=====--------- 3 4 7 8 assigned earns manages ........ ....... ....... : : : : : : ---------------- ---------------- ---------------- | EMP# | DEPT# | | EMP# | MONEY | | EMP# | DEPT# | --====---------- --====---------- --====---=====-- 9 10 11 12 13 14

Some pseudo-records may have several such keys, when one-to-one relationships are involved. Here, we see that each department (14) is managed by (at most) one employee (13), and each employee manages (at most) one department.

1.1.4 Merge Pseudo-Records Having Compatible Keys

For this example, key fields involving the same entity types are compatible. Merging all the pseudo-records having EMP# as the key yields:

manages ..................................... : : : earns : : ........................... : : : : : : : assigned : : : : ................. : : : : : : : : : : : has-name : : : : : : ........ : : : : : : : : : : : --------------------------------- - - - - | EMP# | NAME | DEPT# | MONEY | DEPT# | --======------------------------- - - - - 3 9 11(13) 4 10 12 (14)

Merging all the pseudo-records having DEPT# as the key yields:

manages ................ : : : has-name : : ........ : : : : : ----------------------- | DEPT# | NAME | EMP# | --=====----------====-- 7 14 8 13

Note that the "manages" pseudo-record (13-14) can be merged in two ways, based on either of its keys. For the purpose of this example, we choose to keep the fact in the department record. The resulting department record could be keyed either by department or by manager, since managers are in one-to-one correspondence with their departments.

1.1.5 Assign Representations

In the general case, some restructuring could occur at this point to deal with representation problems. In the present simple example, however, we observe that all fields except MONEY (12) already contain character strings that can serve as representations in the data. All that remains here is to specify a representation for field (12). We assume here that it will be expressed in dollars, in integers.

earns ........................... : : : assigned : : ................. : : : : : : : has-name : : : : ........ : : : : : : : : --------------------------------- | EMP# | NAME | DEPT# | MONEY | --======------------------------- 3 9 11 4 10 12 Representations: EMP# NAME DEPT# INTGR-DLR manages ................ : : : has-name : : ........ : : : : : ----------------------- | DEPT# | NAME | EMP# | --=====----------====-- 7 14 8 13 Representations: DEPT# NAME EMP#

1.1.6 Name the Fields and the Record Types

---------------------------------- "EMPLOYEE RECORD": | EMPNUM | ENAME | DEPT | SALARY | --======-------------------------- 3 9 11 4 10 12 Representations: EMP# NAME DEPT# INTGR-DLR ------------------------- "DEPARTMENT RECORD": | DEPTNUM | DNAME | MGR | --=======-----------===-- 7 14 8 13 Representations: DEPT# NAME EMP#

We have assigned arbitrary illustrative names for the purposes of this example. In general practice, there is no systematic discipline for the naming of fields and records.

1.1.7 Consider Alternatives

This step would actually precede the prior one, since there is little point in naming fields until the final field configuration is chosen. However, it is easier to illustrate the design alternatives using the unique and mnemonic field names we just chose.

We show a number of alternative record designs which could be considered for representing the same set of facts:

1. The current design:

-------------------------- ------------------- |EMPNUM|ENAME|DEPT|SALARY| |DEPTNUM|DNAME|MGR| -======------------------- -=======-------===-

2. Subdivide the employee name field:

----------------------------------- ------------------- |EMPNUM|FIRST|MID|LAST|DEPT|SALARY| |DEPTNUM|DNAME|MGR| -======---------------------------- -=======-------===-

3. Keep the "manages" fact in the employee record, i.e., identify the department which the employee manages (possibly none):

----------------------------------- --------------- |EMPNUM|ENAME|DEPT|SALARY|MGS-DEPT| |DEPTNUM|DNAME| -======-------------------========- -=======-------

MGS-DEPT contains a department number or a blank. This design alternative is only valid if all departments have managers, or null values are permitted in the EMPNUM key field.

4. Put all the department information into the record of the department's manager, eliminating the separate department record (subject to the same restriction):

----------------------------------------- |EMPNUM|ENAME|DEPT|SALARY|MGS-DEPT|DNAME| -======-------------------========-------

5. If we have the additional constraint that a manager belongs to the department he manages, then the "manages" fact could be kept by a simple flag in the employee record:

----------------------------------- --------------- |EMPNUM|ENAME|DEPT|SALARY|MGR-FLAG| |DEPTNUM|DNAME| -======---------------------------- -=======-------

MGR-FLAG is a simple binary flag indicating whether the employee manages his department.

6. We could create a separate record type for employees of each department, eliminating the need for a DEPT field:

--------------------- ------------------- D01 EMPLOYEES: |EMPNUM|ENAME|SALARY| |DEPTNUM|DNAME|MGR| -======-------------- -=======-------===- --------------------- D02 EMPLOYEES: |EMPNUM|ENAME|SALARY| -======--------------
.
.
.

Some of these alternatives can also be combined, e.g., maintaining the "manages" fact in both the department record and the employee record.

1.2 A Skeleton of the Methodology

Subsequent descriptions of the methodology will be interspersed with definitions and explanations of background concepts, perhaps making it appear more elaborate then it really is. We present here a complete skeleton of the methodology, without benefit of definitions or explanations, to keep its overall extent in perspective.

2 THE UNDERLYING IDEAS

2.1 Facts

We have used the term "fact" to mean something like "employees are assigned to departments". Strictly speaking, we should use a term like "fact type" or "fact pattern" here, and reserve the term "fact" for a specific item of information, such as "Joe Smith is assigned to the Sales department". However, wherever the intent is clear, we will continue to use the term "fact" in the broader sense.

A data record usually contains a number of facts, such as an employee's department, salary, etc. In our approach to data analysis, we first identify the facts to be maintained, then aggregate them into data records.

Most methodologies suggest that data represents two kinds of facts: facts about things (attributes) and facts that connect things (relationships). Facts about an employee might include his salary, department, birthdate, and birthplace. Connecting facts might connect an employee to a project, or to his/her spouse.

If you're paying attention, you should be confused at this point. How did we decide that an employee's department was a fact about the employee, rather than a fact connecting two things? Why did we also decide that about his birthplace? I don't know the answer to that, and I don't think anyone else can give you a really effective answer either.

It doesn't have to matter.

2.1.1 Facts Connect Things

We can avoid the question altogether by treating all facts as connections between things. This may seem surprising at first, and it does take a bit of getting used to. But it simplifies things. Why shouldn't we think of some money as a thing, with the salary fact connecting that thing to an employee? We can just as easily think of dates or distances as things; then birthdays and heights become facts connecting such things to employees. Later on we shall see that identifiers can also be treated in the same way.

----- earns -------------- | |----------->| money | | | -------------- | | assigned -------------- | |----------->| department | | | -------------- | | born on -------------- | e |----------->| date | | m | -------------- | p | born at -------------- | l |----------->| place | | o | -------------- | y | works on -------------- | e |----------->| project | | e | -------------- | | married to -------------- | |----------->| person | | | -------------- | | height -------------- | |----------->| distance | ----- --------------

We seem to have the idea in mind already, in some of the ways we talk about things. When we say that "6 feet" and "72 inches" mean "the same thing", what do we mean? It isn't the character strings that are the same. Rather, there is some other thing that those character strings both stand for. The thing happens to be a distance, and the "height" fact connects that thing to some people.

That's what we focus on in this methodology: facts that connect things together.

2.1.2 How Many Things?

Not all facts involve pairs of things. Facts can connect any number of things simultaneously. Consider the fact that John bought a certain computer from Sears. That involves three things: John, the computer, and Sears. This fact can't be replaced by the two simpler facts that John bought a computer and John shops at Sears. He may have bought the computer elsewhere, and he may buy other things at Sears. Thus this fact has to be treated as a unit involving three things. In general, facts can involve any number of things.

2.1.3 Why They're Connected

We usually have to say what kind of connection the fact represents. It's not enough to say that a fact connects an employee to a department, or to a date, or to a distance. We have to say whether he's assigned to or loaned to or manages the department, whether the date is his birthdate or hire date, and whether the distance is his height or something else.

In ordinary language, we don't always have to be explicit about these things. Sometimes we can guess correctly: an employee's department almost always means the department to which he is assigned. Sometimes we use a word that tells us two things: "birthday" says the thing is a day and the connection is "born on"; "height" tells us that it is a distance and also how it relates to a person; "salary" tells us that it is money and that it is earned. Sometimes we can't even find two different words for the two ideas: "color" is the name of an entity type (kind of thing), and it is also the name of its relationship to colored things (at best, we could call the relationship "color-of", or "has-color"). But in general we need to know at least these things about a fact: what it's connecting (employee and money) and why they're connected (it's his salary).

2.1.4 The Roles They Play

Sometimes we have to clarify a third thing about a fact: what role each thing plays in the fact. This is particularly necessary when the fact connects several things of the same kind. We may have two employees connected by a "manages" relationship, or two parts connected by a "contains" relationship, but that's not enough information. We have to know who manages who, or which part contains which. One way to do this is to assign a sense of direction to the connection. We can say that the "manages" relationship goes from the manager to the employee, and the "contains" relationship goes from the assembly to the component.

However, direction is difficult to deal with when the fact involves more than two things. When John buys a computer from Sears, which is the "from" and which is the "to"?

A more general approach is to use role names, like "manager" and "managee", or "component" and "assembly", or "buyer", "seller", and "commodity". The use of role names actually covers directed relationships as well, since we can always use "from" and "to" as role names.

As before, we don't always feel the need to be explicit about roles. Very often, we can use the entity type names as default role names: in an "assigned" relationship, there is no doubt about the roles played by the employee and the department.

2.1.5 Example

Let's put this all together into an example that illustrates various aspects of a fact. In today's complex economic world, some companies own other companies, and they might sell such companies to each other. If we wanted to keep track of which sold which to which, we might need this form of the "sale" relationship:

relationship: SALE ................... : : : roles: BUYER SELLER COMMODITY : : : entity types: COMPANY COMPANY COMPANY

One additional aspect of facts will be covered later: how things are represented. We haven't said anything in this example as to how the companies are to be identified. It might be by name, by stock exchange symbol, by some taxpayer code, by some other conventional abbreviations, etc.

2.2 Entities and Relationships

We have identified the following things to say about a fact:

2.2.1 Entities

The "kind" of thing involved in a fact is what we mean by the term entity type.

This term gets used in several ways, which can be confusing. It is sometimes expected that employees and departments are entities, but money and colors and dates are not. We don't make that distinction. It gets to be too troublesome in many cases. Children and birthplaces, for example, are often not considered entities, but we're hard pressed to explain why.

We treat all these things as entities. Any person, place, thing, or idea is an entity. Anything to which a noun or noun phrase can refer is an entity. That simplifies things for us. We don't have to worry about which are and which aren't entities. In terms of data, it is convenient to say that all data fields refer to entities. We will later explore why other design approaches want to distinguish between entities and non-entities.

2.2.1.1 Instances and Types

The term entity, as we use it, refers to an individual instance, or occurrence. You and I are entities, being occurrences of the entity type "person".

In general, database design and description deal with types, e.g., "employee" and "department", while the database itself mentions the individual occurrences, namely the individual employees and departments.

As with most such generalizations, there are exceptions. Sometimes individuals are mentioned in the database description, when constraints specify the allowed values for a field, such as the set of valid colors, or the valid sexes. Furthermore, as we shall mention later, when the population happens to consist of a single individual, it may not be mentioned anywhere at all.

Conversely, we sometimes name types in the database. Most of the time, for example, when we talk about "parts" we really mean part types. "Part number" generally identifies a part type, whereas it is "serial number" that really identifies individual parts. In an inventory record, the PART field doesn't tell us which part we're counting (one part is one part, after all); it tells us the type of part we are counting.

2.2.1.2 Subtypes

Later on, when we discuss the merging of pseudo-records, we will be interested in entity subtypes. For example, we might be interested in merging data about salesmen with general records about employees. Therefore it will be useful to specify which entity types are subsets of which others. One entity type is a subtype (or subset) of another entity type whenever all members of the first are always members of the second.

One important aspect of subtypes is the inheritance of properties. Any fact involving a type automatically applies to a subtype. For example, if employees have employee numbers and departments, then salesmen have employee numbers and departments.

Types can overlap in a more general way, with neither being a subset of the other. Employees and customers are two such types. We will not need this generalized overlap concept in our design methodology.

2.2.2 Relationships

The kind of connection that binds entities together in a fact is what is meant by a relationship type.

A relationship type has a name, and a fixed number of things that it can connect together. That number is called its degree. A relationship between two things has degree 2, and is called a binary relationship. A relationship among three things has degree 3, and is called a ternary relationship. In general, a relationship connecting n things has degree n; however, the term n-ary is often reserved for relationships of degree greater than 2.

When a directional convention is used for binary relationships, a distinct name is sometimes given to each direction, e.g., "owns" and "owned by". While this may be useful at an end user interface, it is not essential to this methodology, and we do not adopt that convention here.

2.2.2.1 Participation

To determine pseudo-keys, it is necessary to know whether a relationship is one-to-one, one-to-many, or many-to-many. Furthermore, we may have to know this about combinations of roles in n-ary relationships. Consider an inventory relationship, involving part types, warehouses, and quantities on hand (QOH). Though each part type and each warehouse are likely to occur several times, the combination of a given part type and a given warehouse should occur only once.

We will use a notion of "participation" to express such characteristics.

For each role in a relationship, we will specify whether an entity can play that role in more than one occurrence of the relationship. E.g., an employee can occur in at most one occurrence of the relationship of being assigned to a department. We will also identify sets of roles in which such participation is limited to one. E.g., the combination of a part type and a warehouse may occur at most once in the inventory relationship.

Since we are mainly interested in whether the maximum participation is one or many, we will say that a role, or set of roles, has a maximum participation (MP) of 1 or n. 1 would signify a single-valued relationship, n a multi-valued one. E.g., for employees assigned to departments, MP for employees is 1 (each employee can have at most one such assignment), while MP for departments is n (a department could have many such assignments.)

In a similar way, when we plan to merge records, we will have to know whether all entities of a given type will necessarily participate in a relationship. That is, we have to know whether the relationship involves all or only some of the entities.

We will specify whether or not an eligible entity has to participate at least once by saying that a role, or set of roles, has a least participation (LP) of 0 or 1. 0 would signify an optional relationship, 1 a required one. E.g., for employees assigned to departments, LP for employees is 1 (each employee must have at least one such assignment), while LP for departments is 0 (a department might have no such assignments).

We combine the two concepts into a compact notation of the form LP*MP specified for each role in a relationship. In addition, it is useful to specify participations for sets of roles when the maximum participation is 1 for the set but not for all of the roles in the set.

The possible values of LP*MP are:

We illustrate the notation:

assigned inventory ........ ...................... : : : : : EMP DEPT PART-TYPE WAREHOUSE QUANTITY 1*1 0*n 0*n 0*n 0*n ---------------------- 0*1

Exercises: What would the participations be if each type of part had to be stored in some warehouse? In every warehouse? What if each warehouse had to stock some type of part? All types?

2.2.2.2 Unary Relationships

Although we have been treating facts as connections or relationships among things, there is a limited sense in which it might be useful to think of facts involving single things as well. This would be like a simple list of things which are of interest for some reason. We might have the list of all employees, or the list of all products, etc.

For this reason, it also makes sense for us to allow unary relationships, i.e., relationships of degree one. Thus, by analogy with the binary relationship "assigned(Smith,Sales)", we could have the unary relationship "employee(Smith)".

We have introduced this principally to permit the unary relationship "exists", which we will discuss later. We suspect that other unaries may also prove to be useful.

2.2.2.3 Attributes of Relationships (Facts About Facts)

So-called "attributes of relationships", such as the dates when people started working for companies, can sometimes be handled conveniently as n-ary relationships (in this case, a relationship among three things). But it is necessary to recognize the case when we have several attributes of the same relationship (e.g., if we also had the dates when people stopped working for companies), so that we can merge them into the same record.

For that purpose we can treat relationships and facts as entities themselves, able to participate in other relationships and facts. As a result, we can have "facts about facts", such as:

2.2.3 The Distinction Between Facts and Relationships

In most cases, facts and relationships are the same, as we have been treating them until now. However, it is convenient to let facts sometimes take on a richer structure, using relationships as a basic building block.

In the more general case, we can let facts be compositions or restrictions of relationships.

In composition, a fact is composed of several relationships. This is useful when the underlying relationships are known in the semantic model of the enterprise, but perhaps not all of the supporting details are to be maintained in the database. For example, we may have the relationships "employees have secretaries" and "secretaries have phone numbers", but the only facts we want to store in the database are "employees can be reached at phone numbers", i.e., "X is the phone number of the (unspecified) secretary of Y".

Restrictions are useful when the same relationship is to be maintained separately for different subsets of the possible entity types involved. For example, we might have "owns" as a relationship, but the facts "employees own stock" and "departments own projects" might be maintained separately. Or, we might want to separately maintain the facts "domestic employees are assigned to departments" and "foreign employees are assigned to departments". Or we might wish to use the "sales" relationship twice, once for a company's sales records and once for its purchasing records.

Facts and relationships are much the same thing, especially if we extend the concept of "relationship" to include derivable relationships, defined in terms of others. However, we tend to use "relationship" for the primitive associations, and "fact" for the more general notion of any association, primitive or derived.

We can thus describe a fact as a named composition or restriction of relationships among entity types (which may in turn themselves be facts). Facts have degrees and roles, derived from their constituent relationships.

2.3 Representation

We turn now to the fourth aspect of a fact: how things are represented. As we shall see, representation of entities can turn out to be a complex and difficult topic. One of the advantages of our methodology is that we do not allow this complexity to overshadow the whole design process. By deferring this to the end, we can first design for the facts we want maintained in the data, and then work out questions of representation.

Most methodologies assume that a simple representation is available for all entities. What we deal with in this section goes beyond the capabilities of other methodologies. The reader who wishes to deal with the simple case first, or who wishes simply to compare this with other methodologies, is encouraged to skip this entire section on first reading.

2.3.1 Why Represent?

So far, we have really been discussing facts in the real world, not how they are expressed in data. When we said that an employee is assigned to a department, we had in mind a real flesh-and-blood employee, and a real working department. We didn't say that a certain person's name was assigned to a certain department name, nor did we say that a certain employee number was assigned to a certain department number.

Unfortunately, we can't put people and departments into the machine. We have to let symbols -- typically, character strings -- serve as surrogates. To represent a fact which connects two things, we create some conjunction of surrogates for those two things, generally by putting the symbols into the same record. To do that we exploit certain kinds of facts, namely those which relate things to symbols. When we put an employee number and a department number into the same record, representing the employee's assignment to that department, we are really dealing with three facts:

2.3.2 How Represent?

An ideal representation would provide every entity with a uniquely corresponding symbol (character string) to serve as its surrogate.

We use "symbol" and "representation" in a very general way to mean any character string, particularly as it is used to stand for some other thing. The terms include the general ideas of naming, identifying, denoting, etc. "Bill Kent" is one of the symbols that can represent me, as well as several other people. "6", "6 feet", "72", "72 inches", are some of the symbols that can represent my height.

We know that idealized representation doesn't exist in a universal sense. Most entities don't have a single simple symbol that stands exclusively for them and nothing else in the whole world. For most things, we have to resort to some sort of descriptive phrase, such as "the IBM employee with employee number 999999". The best we can provide is that for some types of entities, there are some types of symbols which can serve as surrogates. Under these conditions, a symbol serves as a surrogate for an entity -- provided that we know what symbol type it belongs to, and what type of entity it is representing. Thus, "999999" represents a certain entity -- if we know whether the symbol is a social security number or an IBM employee number or an IBM part number or perhaps something else.

Representations fit into our general structure of facts. "Employees are represented by employee numbers", "people have names", and "heights are represented in inches with picture 99.9" are facts. Corresponding to this, we acknowledge that symbols are entities, and they are collected into entity types which we will also call symbol types. Employee numbers and people's names are two such symbol types. We can also think of all symbols collectively as one large type. Then, in terms of subtypes, we can say:

In other words: employee numbers are symbols, and symbols are entities.

2.3.2.1 Symbol Types

A precise treatment of symbol types quickly becomes complex, and too esoteric for our purposes. We take a pragmatic approach, adequate for a practical data design methodology.

As a first approximation, we think of a symbol type as the set of character strings permitted to occur in a data field. (It is comparable to the "domain" concept in relational database theory.) Any characteristic which determines which actual character string can represent a given entity in a given field will also serve to distinguish different symbol types. For example, a given data field will probably accept exactly one of the following as a representation of my height: "6", "06", " 6", "6.0", "6.00", "6 feet", "110", "VI", "72", etc. Each of those belongs to a different symbol type. That is, there are at the very least nine different symbol types that can represent lengths (the reader can surely identify a great many more). All of the following properties might be involved in the description of a symbol type, and might serve to distinguish one symbol type from another:

In an elegant theory, each of those properties could be expressed as a distinct "fact" about a symbol type. For the purposes of our methodology, we will simply let those be part of the informal description of distinctly named symbol types.

Different sets of character strings necessarily constitute different symbol types. But we don't force the converse. That is, different symbol types might contain the same character strings. E.g., employee numbers and part numbers might both consist of six-digit integers. It is acceptable to either treat this as one symbol type ("six digit integers"), or as two symbol types ("employee numbers" and "part numbers").

Some symbol types are "self-describing", instead of having all of their properties fixed and specified. One symbol type might include both "6 feet" and "72 inches". It is "self-describing" because the unit of measure is included with each item, rather than being specified for the entire set.

2.3.2.2 Symbol Facts

We will refer to a fact in which one of the entity types is a symbol type as a "symbol fact".

Not all symbol facts are used for representation. Some of them simply provide information, such as maiden names of employees, or a table of abbreviations (connecting two symbol types).

Others may be said to "represent", even if they are not unique. In a certain sense, people are represented by their names.

2.3.3 Other Representation Techniques

Not all representation is done by simple symbols serving as surrogates.

2.3.3.1 Structured Symbols

Some representations are structured in a form that requires multiple fields to represent a single entity. Examples:

Quite often, it is an arbitrary design decision whether such things are represented in single or multiple fields. It has little to do with the nature of the facts being maintained. Addresses may get split into multiple fields to facilitate formatting for letters and labels.

It is sometimes argued that in such situations we ought to describe components of things, or relationships among things (e.g., a date is a "relationship" between a month, a day, and a year), but that can be quite artificial. For the purposes of this design methodology, we bypass such considerations, and simply allow a symbol type to be specified as structured, along with an informal description of its structure.

Various situations can give rise to multi-field representations, including:

2.3.3.2 Derived and Indirect Representations

Sometimes an entity will not have a distinctly visible representation in the data. Such an entity will not (directly) correspond to any field.

For example, an entity might be inferable from a related entity, or from its representation. Dates are sometimes embedded in the identifiers of things (date of claim is part of claim number, birth date is part of some European person numbers). A great deal of information may be embedded in part serial numbers: part type, place of manufacture, certain technical characteristics, etc.

In some cases the inference is more computational. Part numbers in certain ranges, or employee numbers in certain ranges, imply certain company locations. In some countries, person numbers are odd or even according to sex. Day of week can be computed from a date. Winners of elections can be computed from vote counts.

In the extreme case, some entities don't have any identifiers of their own at all, but are only known indirectly via some related entity. Elections don't have identifiers, but are known by the year in which they occurred: "the 1980 election". Sales territories might only be known by their headquarters cities: "the Chicago territory".

2.3.3.3 Non-Representation

The representation of some entities is derivable from the "context". Such entities are constant over all occurrences of the record type, and hence don't have to be identified via any data field. A given company is always the "seller" in its own sales records, and always the "buyer" in its own purchasing records. The playing schedule for a particular team will not list both teams for each game.

Sometimes the identity of an implicit entity is included in the record name. E.g., if we have separate employee records for each job type, then we don't need a field for the employee's job. If we have separate employee records for each company location, then we don't need a field for the employee's location.

And sometimes entities are not represented because we don't care about identifying the individual! Such entities are only involved as intermediaries in specifying the fact. For example, we might record the phone number of an employee's secretary, without identifying the secretary. (In some environments, that changes too fast.)

2.3.3.4 Compressed Representations

The derived and indirect representations mentioned above essentially compress several entities into one field. In effect, one field represents a sales territory and its headquarters city. Another field represents a person (via his person number) and also his date of birth and his sex.

In practice, this is sometimes carried to the extreme of compressing several entities into one field simply because their representations are "too short"; letting them each occupy a single field would be inefficient.

For example, if sex and marital status were each representable by a single-character code, they might be combined into a "personal status" field containing the two characters. Even more extreme are computational forms of encoding: e.g., digits 1-4 represent the marital status of males, 5-8 the marital status of females.

While this is not considered a fit topic of discussion in logical data design, it is in fact a common practice which alters the logical design (field layout) of a record. It needs to at least be acknowledged, and documented when it does happen.

2.3.3.5 Mapping Techniques

While some representations can be expressed in terms of symbol facts, we don't require that for all cases. It makes the methodology too cumbersome.

In effect, we will simply allow any entity to be replaced by a suitable representation. This representation might cause the introduction of multiple fields, or it might cause several entities to be represented in the same field, or the representation might occupy no fields at all. Sometimes additional facts will have to be "imported" in order to achieve suitable representation.

For documentation purposes, it would be advisable to record the mappings between the data fields and their corresponding roles in the facts. As indicated, such mappings may be rather elaborate.

2.3.4 "Quality" of the Representation

We have so far described some assorted techniques for representing entities, e.g., by simple symbols, structured symbols, indirect reference, etc. We haven't yet said that such representations are necessarily unique. Uniqueness is just one of a number of qualities that a representation might have.

The qualities are not always all required. The overall process of selecting representations involves three distinct steps:

We first discuss some of the qualities of representation.

2.3.4.1 Existence

The first desirable quality of a representation is that one should in fact exist. Most things in the world around us don't have simple identifying labels. We refer to most things by a combination of pointing, complex noun phrases involving relationships to other things, and default assumptions regarding context and implicit qualifiers.

Entity types without direct representations include the earlier examples of elections or sales territories, and the individual pieces produced by some assembly lines. Animals are another example (though pets usually have representations, while wild animals usually don't).

Data representations for such entities can usually be obtained either by assigning arbitrary new identifiers (part numbers), or finding related entities suitable for indirect representation.

2.3.4.2 Completeness

A complete representation provides a representation for every member of an entity type. Employee numbers are complete for employees (of a given company), but not for people in general. Maiden names are incomplete representations.

Representations that are incomplete can't really be used for any effective sense of "representation", but would simply be recorded as facts about the entities. Military service numbers, social security numbers, and maiden names might occur in this way in employee records.

2.3.4.3 Uniqueness

A unique representation meets the common requirement for identification: a representation corresponds to at most one of the entities. People's names are generally non-unique; the same string can correspond to several people.

Non-unique representations commonly occur in data, and can still be quite useful to human users. Such users frequently make use of subtle default assumptions and local conventions. It is a natural and useful device, not to be rejected automatically as poor design.

Uniqueness is mainly required when the agent that will retrieve or interpret the data won't take advantage of such assumptions or conventions. Such an agent might be a computer (database manager) or a file clerk. In order for such an agent to retrieve a record about an employee, the employee's identifier must lead only to his record and nobody else's. If we want the agent to follow a path, perhaps to go from a department record to the employee record of its manager, then the manager's identifier must lead only to his record and nobody else's.

Uniqueness is especially important when the simple-minded agent is only able to cope with one record at a time. Non-uniqueness will disrupt the processor, if it retrieves several records when it was prepared to receive only one.

Finally, uniqueness has some significance simply because it is mechanically enforceable. The simple-minded agent can detect and reject multiple occurrences of an identifier, if and when the agent is instructed to do so.

2.3.4.4 Singularity

A singular representation provides each entity with at most one corresponding representation. This is inverse to the notion of uniqueness. Social security numbers, though unique, are not singular: a given person might have several of them.

Singularity is not always required, and is sometimes undesirable. Measurable quantities often need different representations in different contexts -- at the very least, different units of measure. Sometimes it is important to represent people by employee numbers in one context and by social security numbers in another.

Lack of singularity could lead to:

Lack of singularity could be compensated if the processing agent could recognize the equivalence of several representations of the same thing, and perform conversions. Such conversions might be computational, as for measured quantities, or supported by other stored facts, such as the mappings between social security numbers and employee numbers. Current data processing agents rarely provide such capabilities, making singularity all the more important. The ability to navigate through non-singular identifiers is beyond the capability of current query processors (relational or otherwise) [K2].

While uniqueness is a widely recognized requirement for identifiers, singularity is little mentioned in the literature. This might be for several reasons:

We mention in passing that there are at least three ways to violate singularity:

2.3.4.5 Other Qualities

We mention some other qualities of representations that need to be considered. The importance of such criteria is a matter of judgement to be exercised during the design process.

A representation should be stable. If it were subject to change, then it would have to be updated wherever the entity was mentioned. This applies to "direct" representations, e.g., names, and also to indirect and qualified representations. It would be inadvisable to name sales territories after their headquarters cities, if the headquarters are likely to move. It is only reasonable to qualify city names with their states because we don't expect state boundaries to shift, transferring cities from one state to another.

It is sometimes held that the ideal identifier contains no information. E.g., embedding dates into serial numbers is considered undesirable. However, it would seem that embedded information is acceptable provided that (1) the information is stable, and (2) the embedding of information can be documented in the dictionary or other data description.

Mnemonic identifiers are generally easier for humans to deal with.

Simplicity is generally preferable. Direct and simple representation is preferable to indirect, derived, or structured representations.

Length is a factor. Anything can be uniquely identified if we can write enough paragraphs about it. Representations that are too long, or too variable in length, are undesirable. Even representations that are too short can sometimes create problems, leading to the kind of compression discussed earlier.

Where possible, uniform representations are preferable. E.g., same representation for the same entity type, where possible. Also uniformity within representation is preferable to self-describing forms. This maximizes the amount of information that can be factored out, minimizes the amount that has to be stored, and simplifies the processes accessing the data.

2.3.5 Using Combinations of Facts

Completeness, uniqueness, and singularity can be described in terms of participations in a fact which provides representation:

represented-by ................ : : THING REPRESENTATION x*y z*w | | | | 1 if complete, else 0- | | --1 if unique, else n | | 1 if singular, else n--- -don't care (usually 0: some names aren't used)

By representations here we mean one or more related things. Normally they would be symbols, but they might not be if we are dealing with indirect or qualified representations. If they are not symbols, then they in turn will have to be provided with representations. E.g., cities can be identified by a combination of city names and states, where the states in turn have to be provided with some representation. For the unique identification of the cities, it doesn't matter whether the states are identified by full names or abbreviations; we don't have to specify that as part of the city representation.

Very often, as in the case of qualified names, the representations will be obtained from combinations of relationships, such as:

is-in ..................... :1*1 1*n: : : : has-name : : .......... : : :1*1 0*n: : : : : : CITY NAME STATE : --------------- : : :1*1 0*1: :..............: identifies

The participations for such combinations have to be specified, since they cannot be derived from the component relationships. In effect, these constitute additional constraints.

2.4 Other Design Concepts

2.4.1 Entities and Records

Choosing the record types is an essential part of data analysis and design. In the present methodology, records are chosen in an objective and systematic way, based largely on the pattern of one-to-many and many-to-many relationships in the facts to be maintained.

Earlier methodologies choose records on the basis of a more restrictive "entity" concept (which we will call "major entities"), on the assumption that a record corresponds to an entity type. There are several difficulties with this approach:

Though our approach is more direct than the "major entity" approach, the results are often quite similar, and for good reason. The major entity approach works fairly well because if an entity type is "major", then we are very likely to have single-valued facts about it, and hence we will naturally have a record for it. Perhaps the popular notion of an entity being "something about which we maintain facts" should be revised to "something about which we maintain single-valued facts".

2.4.2 Existence Management

Sometimes another significant characteristic of "major entities" is that their existence needs to be explicitly managed. Some entities can come and go freely in the database, without any particular announcement of their arrival or departure. We are usually willing to mention a length, or a color, as a fact about something, without announcing that that particular length or color exists. On the other hand, we don't want to refer to an employee without prior assurance that such an employee exists. That is, for some entities it is necessary to have some mechanism for announcing their existence, and also their erasure from our sphere of interest. This can be a criterion for identifying major entities.

Although we try to treat this as a high-level semantic concept, the only mechanism for such announcements in conventional database systems is to insert and delete records. Therefore, it becomes important to have a record type corresponding to an entity type for which such announcements are required.

This is rarely significant, in practice. It almost invariably turns out that single-valued facts will be maintained about anything whose existence needs to be managed, so they will have records anyhow. Exceptions to the rule almost have to be contrived. One case might occur if all the facts we wanted to maintain about something happened to be multi-valued. E.g., if projects had multiple managers and multiple completion dates and multiple goals and multiple budgets, then we might have many "intersection" records for these m:n relationships, but no one record keyed on project numbers. An explicit "announcement" specification would be needed in this case.

In our methodology, we can take advantage of unary relationships to accomplish this. For any entity type T for which such "existence management" is required, specify "exists(T)" as a fact to be maintained. A pseudo-record with one field will be generated, and the field will be a pseudo-key. If other records exist with the same key, then this record will vanish harmlessly during the merge step. Otherwise it will survive into the final design as an "existence list" for such entities.

2.4.3 Complex Situations

Other methodologies, based on earlier concepts of entities, relationships, and attributes, tend to have difficulty with:

These cases are dealt with naturally in our methodology.

2.4.4 Normal Forms and Normalization

Normal forms [K4] are valuable as a design objective, yielding designs with minimal redundancy and minimal exposure to update anomalies, sometimes with attendant performance penalties for retrieval operations.

However, some methodologies apply normalization as a process, in which badly-designed records get decomposed (normalized) into better records. Ours is a "synthetic" approach [B], constructing records in normal form directly.

Intuitively, it seems clear that record designs produced by this methodology are in fifth normal form [K4] -- which also implies the first four normal forms -- at least prior to the step where alternative designs are introduced. Informal support for this hypothesis is that (1) the initial pseudo-records are in fifth normal form, and (2) merging on keys preserves normal forms. Formal arguments may be hard to develop, since merges are not quite the same as relational joins (see Section 3.4.5).

2.4.5 Derivable Facts

We are implicitly assuming that the specified facts are independent of each other, in the sense that one can't be derived from the others. This might be violated, for example, if we specified the facts:

These facts are redundant. If implemented unintentionally, we would have a badly designed database. At the very least, we should require that derivable facts be separated from the others. If it is necessary that they be made available from the database, there are two major design alternatives:

A similar assumption is that the specified facts have been decomposed into independent components as much as possible. For example, the fact that

is not, in itself, redundant. But this should (assuming a normal company policy) be presented in the more elementary form:

These assumptions, if formalized, would underlie the demonstration that this methodology can yield record designs in fifth normal form.

3 THE DESIGN PROCESS

Up to this point we have been developing the underlying concepts. We are finally ready to present the design process itself, in some detail. We suspect that, as with many other things, doing it will be easier than describing, teaching, or learning it.

3.1 Specify the Facts to be Maintained

These will be specified in terms of relationships among entity types. In practice, we expect that the facts, relationships, and entity types will get defined in an iterative, concurrent fashion. Sometimes it may be easier to start listing the entities first, and then to collect the facts about them. New facts and relationships may in turn give rise to new entity types.

The set of entity types and relationships should be useful beyond a single application or database. The design of subsequent databases will probably involve many of the entities and relationships established in previous designs, although some new ones will undoubtedly be added.

In the following, we indicate the information needed for each construct. Implementations of this methodology can provide various mechanisms for specifying this information, including possible default rules.

For each fact, specify:

For each relationship, specify:

For each entity type, specify:

We need not assume that all specifications are provided at the outset. In particular, additional facts to support representations may be required at a later stage.

3.1.1 Examples

We will follow two example situations: employees taking courses, and people winning elections. We treat them as an extension to the introductory example, so that the facts presented there are also available, particularly if needed for representations.

For brevity's sake (though it's probably too late for that), we will not illustrate all aspects. We present the facts immediately in diagram form, with role names defaulting to entity type names:

covers has-off located begins ......... .......... ............ ......... : : : : : : : : COURSE SUBJECT COURSE OFFERING OFFERING LOCATION OFFERING DATE 1*1 0*n 0*n 1*1 1*1 0*n 1*1 0*n gets-grade-in ................... : : : EMP OFFERING GRADE 0*n 0*n 0*n ---------------- 0*1 held-in won-by ......... .......... : : : : ELECTION YEAR ELECTION PERSON 1*1 0*1 1*1 0*n

3.2 Generate a "Pseudo-Record" for Each Fact

A pseudo-record is just a fact in disguise. It contains one field for each role associated with the fact. When actually carrying out the methodology, we would record how each field relates to the fact, i.e., the field's corresponding relationship, role, and entity type. The least and most participations for each role or set of roles are inherited by the corresponding fields or sets of fields.

We do not as yet care anything about representations. If the role actually involves a symbol type, then we expect symbols to be in the field. Otherwise, we pretend that whatever thing is playing the role is actually sitting in the field. For example, in the pseudo-record corresponding to the fact that employees have employee numbers,

has-emp# ........ -------------- | EMP | EMP# | --------------

we imagine that the second field contains employee numbers, while the actual employees are sitting in the first field. It may be difficult to imagine some things in the fields, such as colors or days or weights (the actual heaviness, not any numerical measurement of it), but that's the appropriate mental picture at this point.

Pseudo-records could be candidate records for the final design, but they are likely to have two undesirable characteristics at this point:

Example:

covers has-off located begins ........ ......... ........... ........ : : : : : : : : ---------------- ----------------- ------------------- --------------- |COURSE|SUBJECT| |COURSE|OFFERING| |OFFERING|LOCATION| |OFFERING|DATE| ---------------- ----------------- ------------------- --------------- 15 16 17 18 19 20 21 22 1*1 0*n 0*n 1*1 1*1 0*n 1*1 0*n gets-grade-in ................... : : -------------------------- | EMP | OFFERING | GRADE | -------------------------- 23 24 25 0*n 0*n 0*n ------------------ 0*1 held-in won-by ......... .......... : : : : ------------------- --------------------- | ELECTION | YEAR | | ELECTION | PERSON | ------------------- --------------------- 26 27 28 29 1*1 0*1 1*1 0*n

3.2.1 A Shortcut

Representations will be dealt with at a later step, making allowances for a large variety of difficult situations. However, we can take a shortcut at this point which will keep the methodology simple for simple cases, and which brings our methodology more in line with others.

When an obvious and adequate representation is available for an entity type (e.g., employee numbers), then we can use that representation immediately as a "surrogate" for such entities, as follows:

In our imaginary picture, we have now actually let employee numbers occupy the fields, replacing the employees themselves.

In the present series of examples, we only take the shortcut for employees, replacing EMP with EMP#.

3.3 Identify "Pseudo-keys" For Each Pseudo-record

A pseudo-key is a sequence of fields that could be an actual key if all entity types involved had simple unique identifiers. A key is a set of one or more fields that provides uniqueness of records. A given entity or sequence of entities may only occur once in these fields. More precisely, we are only interested in minimal sets: a set of fields will not be considered a key if any subset is a key.

The main value of keys in record structures is that they enforce single-valuedness of facts, i.e., maximum participations of 1. If an employee's number is only permitted to occur in one employee record, he is thereby limited to having only one entry for his salary, his department, etc.

A pseudo-key corresponds to a role or set of roles in a fact having maximum participation of 1 (participations of 0*1 or 1*1).

Example (we use "====" to designate the pseudo-keys):

covers has-off located begins ........ ......... ........... ........ : : : : : : : : ---------------- ----------------- ------------------- --------------- |COURSE|SUBJECT| |COURSE|OFFERING| |OFFERING|LOCATION| |OFFERING|DATE| -======--------- --------========- -========---------- -========------ 15 16 17 18 19 20 21 22 1*1 0*n 0*n 1*1 1*1 0*n 1*1 0*n gets-grade-in ................... : : : --------------------------- | EMP# | OFFERING | GRADE | --===============---------- 23/2 24 25 0*n 0*n 0*n ------------------- 0*1 held-in won-by ......... .......... : : : : ------------------- --------------------- | ELECTION | YEAR | | ELECTION | PERSON | --========---=====- --========----------- 26 27 28 29 1*1 0*1 1*1 0*n

(Note: in these examples, a field number of the form 23/2 means pseudo-record field number 23 with field number 2 substituted to provide a representation. Some of the numbers refer to the first example in the paper.)

3.4 Merge Pseudo-records Having Compatible Keys

The natural thing to do at this point is to merge records wherever possible. Such merging is analogous to a join in the relational model:

--------- --------- ------------- | X | Y | + | X | Z | => | X | Y | Z | --------- --------- -------------

Intuitively, we want to merge these records to mimic the way a data record provides several facts about one "subject". More precisely, these are single-valued facts, so that the "subject" corresponds to a key of the record. Having several single-valued facts about the same subject exactly corresponds to having pseudo-records with the same key. Hence what we will merge are pseudo-records having the same key. The result of the merge will be a new pseudo-record containing all the fields of the contributing pseudo-records, without replicating the shared key.

3.4.1 Merging on Simple Keys

A simple key consists of one field.

3.4.1.1 Ideal Merges

Merging is most likely to occur on simple keys. Two pseudo-records may only be merged if their corresponding keys are going to contain the same occurrences of the same kinds of things. The strictest way to assure this is to require that the corresponding keys contain the same entity types, and that they will both contain the full population of the entity type. In other words, the least participation for each key should be 1. Since all keys have an MP of 1, these keys will have LP*MP = 1*1.

An "ideal" merge is thus possible when the two keys involve the same entity type and both have LP*MP = 1*1.

Example:

covers ........... : : -------------------- | COURSE | SUBJECT | (Record R1) --======------------ 15 16 begins ................................ : : : located : : ..................... : : : : : : : has-off : : : : ......... : : : : : : : : --------------------------------------- | OFFERING | COURSE | LOCATION | DATE | (Record R2) --========----------------------------- 18 19 21 17 20 22 gets-grade-in ................... : : : --------------------------- | EMP# | OFFERING | GRADE | (Record R3) --===============---------- 23/2 24 25 won-by .................... : : : held-in : : ......... : : : : : ---------------------------- | ELECTION | YEAR | PERSON | (Record R4) --========---====----------- 26 28 27 29

3.4.1.2 Unequal Populations

The restrictions can be weakened when we can "pad" the populations. In many cases, a pseudo-record whose key has a least participation of 0 can be "padded" to include the missing entities, with some sort of "null" values for the related entity. For instance, instead of excluding employees who have no departments from the "assignment" pseudo-record, we might include them with a blank department, or some other indication of null. Such padded pseudo-records will then have a full population of values in their keys, and they can be merged with other pseudo-records having either 1*1 keys or similarly padded keys.

Such decisions are sometimes beyond the bounds of "logical" design. It often depends on such things as the capability of the underlying data management system to support null values, or the effect of null values (system supported or otherwise) on the processing logic of various applications, or the amount of storage required if there are many null values.

3.4.1.3 Subtypes

The padding concept can be extended to allow merging of entity subtypes. The entity type in any key field can be "promoted" to some supertype, with the key having a least participation of 0. For example, the fact that "salesmen serve territories" can be promoted to "some employees serve territories".

The resulting pseudo-records can then be merged as above for unequal populations. Either one or both of the pseudo-records might be promoted. Thus this technique might be used to merge facts about salesmen with facts about employees, or it might be used to merge facts about salesmen with facts about clerks by treating them both as employees.

As before, the feasibility of such merges depends on the acceptability of null values. In practice, such promotions will generally be limited to those cases where suitable identifiers are available for the supertypes.

Such merges will sometimes require the importation of additional facts for distinguishing the subtypes, e.g., a job-type fact to distinguish salesmen from clerks.

3.4.1.4 Summary

Merging pseudo-records having simple keys is possible if:

Although these pseudo-records have simple keys, the corresponding records in the final design might have compound keys, introduced for representation purposes. A qualifier field might be introduced for uniqueness, or a single entity might be represented by multiple fields.

3.4.2 Merging on Compound Keys

A compound key contains more than one field.

In pseudo-records, ordinary entities always correspond to single fields. Compound keys correspond to facts. Merging on compound keys requires that the keys represent the same fact, which is a stronger requirement than that the keys contain matching entity types. Matching entity types could be involved in different relationships. Merging compound keys is, in effect, merging attributes of relationships, and we should only merge attributes of the same relationship. For example, we can merge pseudo-records corresponding to the following facts, all of which would have compound keys consisting of people and companies:

It would not be appropriate to merge these with some other facts involving people and companies, such as which people own which companies.

It is still necessary to be concerned with matching populations. In the example above, fact 3 would not exist for people in their current jobs. In order to merge, it must be acceptable to pad fact 3 with occurrences having some form of null dates.

3.4.3 Pseudo-Records With Multiple Keys

A pseudo-record with multiple keys corresponds to a one-to-one relationship, e.g., between department and manager (employee), or between employee and spouse. This creates an opportunity for several entities to be "subjects" in the same record, which is sometimes desirable and sometimes not. We probably wouldn't object to combining the facts

into the record:

------------------------------------------------ | EMPLOYEE | BIRTHDATE | SPOUSE | SP-BIRTHDATE | --========---------------======-----------------

But the same process allows us to merge all department information into the employee records, by putting all such data into the records of the employees who manage departments. That option appears undesirable.

Extensive merging on both subjects becomes inadvisable if the connection between the subjects is likely to change. That's one reason not to maintain information about a department in the employee record of its manager.

Another concern has to do with nulls. If one of the pseudo-keys has 0*1 participation, and the record is merged on that key, then all other fields in the record must admit null values -- including the other pseudo-keys. We can call a pseudo-key that has to admit null values a soft key. Further merging on a soft key is generally undesirable, since softness is propagated. Whenever a record is merged on a soft key, all the fields in the other record must be able to accept nulls, including the key in the resulting merged record.

For example, consider the following pseudo-records, in which we use "=-=-=" to indicate keys with 0*1 participation:

has-name earns manages ........ ....... ....... : : : : : : ---------------- ---------------- ---------------- | DEPT# | NAME | | EMP# | MONEY | | EMP# | DEPT# | --=====--------- --====---------- -=-=-=---=-=-=-- 7 8 11 12 13 14 1*1 0*n 1*1 0*n 0*1 0*1

Suppose we merge the first and third of these on the DEPT# field in order to combine information about departments. Since 14 is a 0*1 key, all other fields in that record must admit nulls after the merge, making 13 a soft key (indicated by ":"):

----------------------- ---------------- | DEPT# | NAME | EMP# | | EMP# | MONEY | --=====---------:-:-:-- --====---------- 7 14 8 13 11 12

In the new record, field 13 must admit null values, since some departments don't have managers. Suppose now we want to pursue our earlier trick of capturing all information about a department in the employee record of its manager. We can do that by merging these last two records on the EMP# fields -- provided we are willing to have nulls in both the EMP# and MONEY fields in the result, to allow for the departments having no managers. (The reader is encouraged to work this out in terms of example records.)

3.4.4 Merging is Optional

The natural and appropriate inclination is to merge as far as possible. It is preferable to aggregate all facts about an entity into a single record, creating a model in which that record "represents" that entity. That minimizes problems of "population management", e.g., making sure that the set of employees who earn salaries is the same as the set of employees assigned to departments. It provides a "master file" concept, i.e., a central list of all the currently existing entities of that type.

However, merging to the maximum degree is not always desirable. Some considerations are logical, others physical.

Merging can be carried to absurd extremes via supertypes. Almost any pair of types can be perceived as subsets of a larger type (resources, assets, products, physical objects, priced things, etc.). Theoretically, these could drive some designs to a single monstrous record with nulls permitted in most fields. In practice, such merging is limited to cases where suitable identifiers are available for the supertypes. Occasionally, new identifiers will be invented to facilitate such merges.

Other considerations are more physical, and probably should not be considered in a purely logical design. Possibly wasted space for null values is one consideration. Or, merging might be undesirable if different facts are used exclusively by different applications, or on different systems, or at different locations.

The option to merge or not merge overlaps the topic of Alternative Designs [Section 3.6].

3.4.5 Merge vs. Join

The merge step of this methodology is quite similar to the join operations of the relational model. There are a few distinctions to keep in mind.

In the first place, merging is a design step, rather than a database operation. That is, we only manipulate data descriptions, not actual data.

Nonetheless, merge operations could be visualized as actually combining record occurrences, just like a join operation. The essential difference here is that merging combines records on the basis of matching entities and facts, whereas the joins involve matching representations and/or matching attribute names. Merging is not even limited to matching entity types, since subtypes of a common supertype may be merged. Thus merging is much like the "entity join" proposed in [K2]. Furthermore, joins delete records which aren't matched ("dangling tuples"), whereas the merge retains them by matching them with null-padded copies of the other record. In this respect the merge is like an "outer join" [D].

3.5 Assign Representations

At this point, our pseudo-records are a mixture of symbols and non-symbols. In order to convert these to proper records, we have to eliminate the non-symbol fields, making sure they are adequately represented via symbolic fields. Most of the time, this simply means replacing an entity type with a suitable identifier type. But sometimes other things happen.

We repeat an earlier note: Most methodologies assume that a simple representation is available for all entities. What we deal with in this section goes beyond the capabilities of other methodologies. The reader who wishes to deal with the simple case first, or who wishes simply to compare this with other methodologies, is encouraged to skip this section on first reading.

To assign representations, do the following steps for each field in the pseudo-record, including any additional fields introduced in the process:

  1. If the entity type in the field is a structured symbol type, replace by multiple fields, and exit.
  2. If the entity type is a symbol type, leave it alone, and exit.
  3. If the entity type needs no representation, delete it, and exit.
  4. If the representation is derivable from another field (including indirect representation), delete this field, and exit.
  5. Find a set of facts whose combination gives the required qualities of completeness, uniqueness, and/or singularity (as well as other qualities).

    In the vast majority of cases, this will consist of just one simple symbol fact.

    Introduce new facts if needed.

    Facts suitable for representing a supertype may be employed. E.g., in seeking identifiers for salesmen, identifiers for employees may be used.

    If the related things are not symbols, the process may later have to be repeated for their representation.

  6. If several suitable sets of facts are available, choose one.
  7. "Import" such facts into the record, if not already there.
  8. Delete the field.
  9. Repeat the assignment of representations for any imported facts.

When finished with all fields in the record, consider compression, as discussed in Section 2.3.3.4.

When finished with all records, it may be necessary to re-merge, to eliminate redundancy of imported facts. For example, after a state field is imported into a record to provide unique qualification for a city, it may become redundant to maintain a separate record telling what state each city is in.

3.5.1 Example

We illustrate these steps in detail for record R1 (continuing from Section 3.4.1.1).

Steps 1-4 don't apply.

Step 5: we introduce two new facts needed for the representations of courses and subjects:

has-number has-name .......... .......... : : : : -------------------- ------------------ | COURSE | COURSE# | | SUBJECT | NAME | -------------------- ------------------ 30 31 32 33 1*1 1*1 1*1 0*1

Step 7: import these facts into record R1:

------------------------------------- | COURSE | COURSE# | SUBJECT | NAME | (R1) --======---=======------------------- 15 30 31 16 32 33

Step 8: delete the fields for which we have just provided representations:

------------------ | COURSE# | NAME | (R1) --======---------- 15/31 16/33

Step 9 requires that we repeat the process for the newly imported fields, but by step 2 we find that these can be left alone, since they already contain symbols. We keep track of the entity types and representations in the final result:

Entity types: COURSE SUBJECT --------------------- Representations: | COURSE# | NAME | (R1) (Symbol types) --=======------------ 15/31 16/33

For record R2 (last seen in Section 3.4.1.1), the main difficulty is in finding a unique identifier for offerings. The best we could find (by asking the people at the school) was that offerings were numbered uniquely only within course. Informally, what we will do for this record is to replace OFFERING by the offering number, recognize that the proper course is already in the record, and allow the combination of offering number and course to represent the offering. Then the course itself has to be represented, using the recently introduced fact relating them to course numbers.

Steps 1-4 don't apply. We begin at step 5 by introducing a new fact assigning offering numbers to offerings:

has-number .......... : : ------------------- | OFFERING | OFF# | ------------------- 34 35 1*1 0*n

We then recognize that the following combination of facts is required to uniquely identify offerings:

has-off ...................... :1*1 0*n: : : : has-number : : ........... : : :1*1 0*n: : : : : : OFFERING OFF# COURSE : ---------------- : : :1*1 0*1: ................. identifies

At step 7, we observe that the "has-off" fact is already in the record, but we have to import the "has-number" fact, yielding:

begins ......................................... : located : : .............................. : : : has-off : : : : ................. : : : : : has-number : : : : : : .......... : : : : : : : : : : : ---------------------------------------------- | OFFERING | OFF# | COURSE | LOCATION | DATE | (R2) --========---=============-------------------- 18 19 21 34 35 17 20 22

We eliminate the OFFERING field itself at step 8, and then apply the representation process again to find simple representations for the COURSE, LOCATION, and DATE fields. The final result:

Entity types: OFFERING COURSE LOCATION DATE ---------------------------------------- Representations: | OFF# | COURSE# | NAME |MMDDYY| (R2) (Symbol types) --==================-------------------- 18 19 21/35 17/31 20 22

In record R3, we again have to provide a representation for offerings. The process is much the same as before, except we now have to import both facts (course has offering and offering has number), since neither had been in the record. The result is the importation of a COURSE field:

Entity types: EMP COURSE OFFERING GRADE ----------------------------------------- Representations: | EMP# | COURSE# | OFF# | GRADE-LTR | (R3) (Symbol types) --=========================-------------- 23/2 17/31 18 24/35 25

In effect, the fact connecting courses (17) and offerings (18) now must be maintained redundantly in records R2 and R3, purely to solve a representational problem. To see the significance, note that this record would not mention courses at all if offerings had unique identifiers. (We might observe that this redundancy violates no normal forms.)

For record R4, we elect to identify elections indirectly by their year of occurrence, and simply eliminate the ELECTION field by step 4. After supplying suitable representations for the other fields, we have:

Entity types: YEAR PERSON ------------------ Representations: | YEAR# | NAME | (R4) (Symbol types) --=====----------- 27 29

The odd thing about this record is that, although it is all about elections, it contains no mention of elections!

3.6 Alternative Designs

This step of the methodology is not fully developed as yet. The kinds of design alternatives to be considered are described in [K3] and [K5]. [K3] has a preliminary proposal for transforms to systematically generate some of the alternatives mechanically.

Each alternative design corresponds to some different way of specifying the facts. Fact specifications are not particularly unique. Even using entity/relationship concepts, there are many ways to describe a given situation. Our objectives are:

A variation of the first goal is to allow application descriptions to be standardized without necessarily imposing a corresponding rigidity on the data designs. It is very useful to be able to document that similar information has been implemented in different ways in different databases.

Of course, even our methodology cannot escape giving some preference to some designs. How the user phrases his application description will determine which design we generate first. The others are generated later in some extra steps. It is unavoidable that some users will learn to phrase their application descriptions so that their "preferred" designs are generated first. The best we can do is to make the alternatives available, in case the user really didn't have a preference in mind, or in case he had neglected to consider some better alternative.

3.7 Naming Fields and Record Types

In general practice, there is no systematic discipline for the naming of fields and records. Hence there is nothing predictable about what such names signify. At best, there might be various guidelines and local standards.

Often, a field name includes indications of some or all of the following things, abbreviated and concatenated in various ways:

The only uniform requirement seems to be that field names be unique within a record. This precludes the use of any one of those constructs consistently as the field name. Even role names may not be unique enough. In fact, the same relationship might occur twice in one record, definitely rendering the role names non-unique within the record. Example: a record containing two birthdays, the employee's and the spouse's.

We give arbitrary names to our fields:

--------------------- "COURSE RECORD" (R1): | COURSE# | SUBJECT | --=======------------ ------------------------------------ "OFFERING RECORD" (R2): | OFF# | COURSE# | LOCATION | DATE | --==============-------------------- --------------------------------- "EDUCATION RECORD" (R3): | EMP# | COURSE# | OFF# | GRADE | --=====================---------- ----------------- "ELECTION RECORD" (R4): | DATE | WINNER | --====-----------

3.8 Iteration

The whole database doesn't have to be designed at once. You can start with any initial set of facts, relationships, and entities, and then come back and add more later, as many times as needed. The key steps in subsequent iterations are:

4 CONCLUSION

4.1 Highlights of This Approach

We focus more on what facts are to be maintained, rather than on what the facts are about. We start with one record per fact, rather than one record per entity.

The user isn't forced to make distinctions between such things as "relationships" and "attributes" (or was that "entities" and "attributes"?). While there are times when he might feel more comfortable making that distinction, there are too many cases where it is hard to decide (e.g., birthplace).

Neither is the user forced to choose between entity types and subtypes.

Other design schemes have one stage for handling relationships, another for handling attributes. But they do practically the same thing in both stages. We don't distinguish between them, and use one process for handling both.

Some design schemes ask the user to proceed in two stages, first for "important" things and later for "details". There is no criterion for deciding what's important. But such methods really expect "important" things to be entities for which there will be corresponding records. We don't have any such expectations. Our design scheme is homogeneous. The user can deal with everything at once if he wants to. Or he can iterate in as many stages as he likes; if appropriate, he can view these as increasing degrees of detail.

Some methods feel that "keys" are the key (pun intended) to entities. If something has a unique identifier, then it is an entity and should have a corresponding record. But that doesn't always hold in practice. Even if cities had unique names, we wouldn't necessarily want a distinct record type for cities just because we happened to be recording people's birthplaces. Our approach defers treatment of unique identifiers until later, after we've determined when they are needed.

We provide a rational and objective basis for choosing the record types.

We use a simplified form of entity-relationship model. It conforms more closely to the way the user perceives his enterprise, rather than to the way the data is structured. He can therefore express his requirements more easily, without being forced to make premature data design decisions.

We go beyond other methodologies in suggesting alternative designs for the application.

The approach is synthetic [B]. While the value of normal forms is recognized as a sometimes desirable design objective, we do not adopt the process of normalization, in which badly assembled records are decomposed (normalized) into good ones. We construct normalized records directly.

4.2 Indirect Benefits

4.2.1 Conceptual Models

Facts are specified in terms of entities and relationships, which themselves ought to be documented. There are various approaches to developing the necessary collection of entities and relationships:

Whatever the approach, the resulting collection of entities and relationships would constitute at least the nucleus of a conceptual model [ANSI] (business data model, enterprise description) for some portion of the business.

4.2.2 Data Documentation

Keeping a record of how the data elements express facts, in terms of the entities and relationships of the business, would constitute excellent documentation of what data is being maintained and what it means.

4.2.3 Understanding the Application

Data elements inherently do represent facts, in terms of entities and relationships, in the form we have developed. If you have trouble organizing your requirements in the form of such facts, then either the application is not suitable for implementation in a database, or you do not adequately understand the application.

A necessary part of the analysis leading to the design of a database consists of identifying the entity types involved, the relationships among them and, separately, the representations to be used. Trying to organize the requirements in that form invariably leads designers, and users, to a clearer understanding of the application.

4.3 Convergence With Other Work

This work is an outgrowth of entity/relationship theory [C] and its semantics [K1], and traces at least back to the "irreducible relations" of [HOT]. It is also an extension of "synthetic" design methodologies [B]. The aims of our work are close to those of Imielinski and Lipski [IL]: "...we believe that the atomic, self-explainable relations, in terms of which all other relations considered are explained, should constitute the input of the design process. The design process should construct more complex database relations in terms of the semantics of input relations."

The work also converges with recent developments in the universal relation branch of relational database theory. Our facts and pseudo-records approximate the objects, associations, basic facts, and base relations of [FMU, MU, MW].

The principal differences between our work and the others:

5 REFERENCES

[ANSI] The ANSI/X3/SPARC DBMS Framework, Report of the Study Group on Data Base Management Systems, (D. Tsichritzis and A. Klug, editors), AFIPS Press, 1977. Also Information Systems 3(4), 1978.

[B] P.A. Bernstein, "Synthesizing Third Normal Form Relations From Functional Dependencies", ACM Transactions on Database Systems 1 (4), Dec. 1976, pp. 277-298.

[C] P.P. Chen, Entity-Relationship Approach to Information Modelling and Analysis, North Holland, 1981.

[D] C.J. Date, "The Outer Join", IBM Technical Report TR03.181, Jan. 1982.

[FMU] R. Fagin, A.O. Mendelzon, and J.D. Ullman, "A Simplified Universal Relation Assumption and its Properties", ACM Transactions on Database Systems 7(3), Sept. 1982, pp. 343-360.

[HOT] P.A.V. Hall, J. Owlett and S.J.P. Todd, "Relations and Entities", in G.M. Nijssen, Modelling in Data Base Management Systems, North Holland, 1976.

[IL] Tomasz Imilienski and Witold Lipski, Jr., "A Systematic Approach to Relational Database Theory", ACM SIGMOD International Conference on Management of Data, June 1982, Orlando, Florida.

[K1] W. Kent, Data and Reality, North Holland, 1978. [excerpts: html]

[K2] W. Kent, "The Entity Join", Proc. Fifth Intl. Conf. on Very Large Data Bases, Oct. 3-5, 1979, Rio de Janeiro, Brazil. [html]

[K3] W. Kent, "Choices in Practical Data Design", Proc. Eighth Intl. Conf. on Very Large Data Bases, Sept. 8-10, 1982, Mexico City, Mexico.

[K4] W. Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 2(26), Feb. 1983, 120-125. [html]

[K5] W. Kent, "A Catalog of Logical Data Design Options", IBM Technical Report TR03.224, March 1983.

[MU] David Maier and Jeffrey D. Ullman, "Maximal Objects and the Semantics of Universal Relation Databases", ACM Transactions on Database Systems 8(1), March 1983, 1-14.

[MW] David Maier and David S. Warren, "Specifying Connections for a Universal Relation", ACM SIGMOD International Conference on Management of Data, June 1982, Orlando, Florida.