William Kent, "The Entity Join", Proc. Fifth Intl. Conf. on Very Large Data Bases, Oct. 3-5, 1979, Rio de Janeiro, Brazil, pp. 232-238. Copyright VLDB, 1979. Distributed by Morgan Kaufmann Publishers, San Francisco. Reprinted with permission. [10 pp]

The Entity Join

William Kent
October 1979

> ABSTRACT
> 1 INTRODUCTION . . . 2
> 2 THE STRICT DOMAIN RULE . . . 2
> 3 THE DOMAIN RULE FOR ENTITY JOINS . . . 3
> 4 CORRELATION TECHNIQUES . . . 4
> 5 RESOLUTION TECHNIQUES . . . 4
> 6 COMPARISON OF THE JOINS . . . 5
>> 6.1 Multiple Names . . . 5
>> 6.2 Non-Unique Names . . . 8
>> 6.3 Multi-Column Names . . . 8
> 7 OVERLAPPING DOMAINS . . . 9
> 8 OTHER OPERATIONS . . . 11
> 9 CONCLUSION . . . 11
> 10 REFERENCES . . . 12

ABSTRACT

A new operator, the entity join, is proposed as an extension to the relational model. While similar to the traditional join, it is based on common entities rather than matching representations. The entity join avoids a number of problems arising from the loose correspondence between names and entities, it rationalizes the traditional join, and it extends the usability of the relational model.

1 INTRODUCTION

The traditional join (more precisely, the equi-join) of the relational model connects two tuples if specified fields in the tuples contain the same symbol. Hence we will refer to that join as the symbolic join (s-join). The entity join (e-join) proposed here connects two tuples if the specified fields refer to the same entity. The entity join directly realizes the principle on which the symbolic join might appear to rest: things related to the same thing are related to each other.

The entity join is like the symbolic join in that it generates a new relation from two relations (possibly the same) by matching on specified columns. The number of columns in the result is the sum of the number of columns in the operands. Whenever a row of one operand "matches" a row of the other operand in the specified columns, the concatenation of the two rows will appear in the result. The difference between the symbolic and entity joins lies in the concept of "matching": the entity join detects a match if and only if the specified columns in the two operand rows identify the same entity.

The difference between the two joins highlights some substantial differences between name-based and entity-based models. Entities must of necessity be designated by symbols in a database, since the entities themselves cannot physically occur there. Such symbols, acting as names or representations for the entities, provide a faithful simulation of a system of entities if each entity has a single, unique, simple name. (We use the terms "symbol", "name", and "representation" interchangeably.) When these three entity-based naming constraints are satisfied, then name-based and entity-based models are indistinguishable. These naming constraints essentially characterize what is meant by an "entity-based" model - an exact 1:1 correspondence between entities and their representations. Such representations would then serve as surrogates for the entities.

In contrast, the symbols in a name-based model have the more familiar characteristics of ordinary names - they may be in m:n correspondence with the entities they name. A name may refer to many entities, and an entity may have many names. In addition, names may have complex structure, constructed from the names of other entities. Such violations of the entity-based naming constraints introduce important differences in the behavior of name-based models, mirrored in the differences to be described below between the symbolic and entity joins. Certain information becomes unavailable, other information can only be obtained by careful navigation along name-equating paths, and the results of certain operations are in danger of being misinterpreted. Thus a name-based model can become somewhat difficult to use correctly. Furthermore, such a model might not have the desired characteristics for a conceptual model: a faithful rendition of certain characteristics of a set of real entities.

The entity join is proposed as an addition to the relational model. It does not replace the symbolic join. The latter is retained for those cases where name-based operations are in fact desired, and for those cases where entity-based operations are unrealizable. This paper provides a conceptual framework for the entity join, without pursuing implementation considerations in depth.

2 THE STRICT DOMAIN RULE

Domains play an uncertain role in relational theory, particularly in connection with the join operation. Formal definitions of the relational model identify domains as the underlying sets in terms of which relations are defined -- relations being subsets of Cartesian products of those domains. Sometimes composites of such sets are also designated "compound domains". Sometimes domains are defined on a very syntactic level, e.g., the domain of character strings, the domain of integers. Sometimes they are invested with a bit more semantic significance: department names, employee numbers, social security numbers.

Join operations are constrained in varying degrees by domain considerations. Sometimes, as in System R, there is no constraint. In other cases, a join is only permitted if the columns being matched come from "comparable" domains. Most generally the domain concept, if it is applied at all, imposes a more stringent restriction on joins, which we will refer to as the strict domain rule: a join may only be performed if the columns being joined come from the same domain. Our discussion of the symbolic join will presume that rule to be in force.

The motivation behind the strict domain rule is not clear. It seems to be a half-hearted attempt to make the symbolic join behave like an entity join. It is as though "12345" as an employee number mustn't be allowed to match "12345" as a machine serial number, because they refer to different things. But if that were the purpose, then joins should also be prevented when a given name might refer to different things even within the same domain, i.e., in a domain allowing non-unique names. It is still the user's responsibility to know what sort of naming rules are in effect for the domain, and to judge therefrom whether the symbolic join might generate spurious connections, or perhaps even miss some relevant ones (if an entity is referenced by different names).

The present proposal establishes a clean functional separation between two kinds of operations. The symbolic join is purely for symbol matching, and should be subject to no domain restrictions. The entity join is for entity matching, under domain restrictions designed to reflect the real behavior of entity names.

3 THE DOMAIN RULE FOR ENTITY JOINS

Several approaches might be considered.

If everything we dealt with in a database had single, unique, simple names, then we would have no need for domain rules on joins (nor would we have to distinguish two kinds of join). Integers provide one such naming scheme: each entity could be assigned a single unique integer as its identifier. Joins could be performed without any domain restrictions (although such restrictions would still be useful as integrity constraints for tuple insertion and modification). If a pair of fields referred to the same entity, a joined tuple would result, otherwise not. There would be no "error conditions" (i.e., mismatches) to guard against.

This ideal case does not exist, of course, to any significant extent - at least with regard to externally known names. As a next best approach, we might postulate two kinds of tables: relationships and nomenclature. Relationships would be maintained entirely in terms of surrogates (singular, unique, simple names). Where such names are externally known, they would be used in the relationships. Otherwise, surrogates would be internally generated. All correlations between surrogates (internally or externally generated) and other kinds of names would be maintained separately, in the nomenclature tables. At this level, domain restrictions on join operations are still unnecessary for the relationship tables. Users need not necessarily see such tables; by means of views defined as joins of relationship tables and nomenclature tables, they could see familiar tables entirely in terms of externally known names. Of course, the update rules for such views are difficult to define, and the view definitions would have to reflect some of the naming rules.

We will pursue an approach which does not require the internal generation of surrogates. Instead we will identify those domains in which the names do behave like surrogates, and then define the behavior of the entity join over such domains.

We permit the specification of domains as unions of other domains. A reasonable formalism is to allow a domain to be declared a sub-domain of another. Every domain is, by the usual definitions, a sub-domain of itself.

We allow compound domains, i.e., a domain may be (a subset of) the Cartesian product of other domains (this allows a domain to span multiple columns of a relation).

We will say that a domain is resolvable if either (1) all names in that domain are unique, or (2) mechanisms are supplied for resolving references to non-unique names. We will say that a domain is correlatable if either (1) all names in that domain are singular, or (2) mechanisms are supplied for equating the multiple names of an entity. We will say that a domain is well-named if it is resolvable and correlatable.

(If n1 and n2 are names for entities e1 and e2, respectively, then the names are unique if n1=n2 implies e1=e2, and the names are singular if e1=e2 implies n1=n2.)

These properties are inherited by sub-domains: if a domain is resolvable or correlatable, then so are its sub-domains. The converse doesn't hold: the union of resolvable or correlatable domains might lose those properties.

We observe in passing that the correlatable domain can include the case where different domains give different kinds of identifiers to some common entities (e.g., employee numbers and social security numbers). If a mechanism is provided for equating the two kinds of names (for those entities having both kinds), then the union of those domains constitutes a correlatable domain.

The domain rule for entity joins: An entity join may be performed over domains which are sub-domains of a well-named domain. That is: if columns X1 of relation R1 come from domain D1, and similarly for X2, R2, and D2, then R1 and R2 may be e-joined on X1 and X2 if D1 and D2 are subsets of a well-named domain.

This allows: matching on the same name only if it refers to the same entity, matching on different names if they refer to the same entity, and matching on the same entity even if different domains are involved.

Realization of the entity join presumes a formalism for declaring domains, sub-domain relationships, and identification of well-named domains. Techniques for achieving correlatability and resolvability are not part of the formal semantics of the entity join. However, some possibilities are explored in subsequent sections.

4 CORRELATION TECHNIQUES

For non-singular names, various kinds of equivalence specifications could be accommodated. Sometimes they are provided as simple name pairs, in the form of a two-column table. Sometimes they ought to be unnormalized lists of equivalent names, e.g., for social security numbers. Some equivalences are algorithmic, e.g., for certain variants of people's names (replacing first names by initials, etc.), and also for numeric conversions (we would like to allow entity joins on common quantities expressed in different units or data types). For interactive systems, the correlation might be established dynamically, perhaps by dialog when unfamiliar names are included in newly inserted tuples.

5 RESOLUTION TECHNIQUES

Mechanisms for resolving non-uniqueness will probably be the hardest to devise and the least likely to be implemented. There is, of course, little the entity join can do to deal with truly ambiguous (non-unique) names, if users provide no basis for resolving the ambiguities. In that case, the best we can do is to prevent the entity join from executing over ambiguous domains, so that the user won't be misled by apparent matches. If he is really interested in symbolic matches, he can use the symbolic join over such domains.

We define the entity join as being able to deal with non-unique names because that is a requirement for achieving entity-based operations. And the definition in that form will accommodate techniques which may be devised in the future for dealing with ambiguous names.

Such techniques are possible, and we will sketch a few in order to bolster that faith and to stimulate further work. One technique might be called "implied compound keys". It works in tables where names in a single column are non-unique, but become unique when compounded with other columns. For example, department names might only be unique within division, hence become unique when compounded with division name. The table definition could indicate that an entity join on a department column is to implicitly involve a specified division column as well, as the compound key. (The department and division columns must together be covered by a well-named domain.) Thus, when performing a join on the department columns of two tables so defined, the symbolic join would match tuples containing departments having the same name, while the entity join would match tuples referring to the same department. The table definition would cause the entity join to automatically match the divisions as well. One advantage of this technique is that it captures the user's intent without making him dependent on the mechanism of name resolution. For example, if department names were to be made unique at some later time (without dependence on division names for qualification), he would not have to rewrite his applications which contained entity joins on the department columns.

Another technique might be called "secret surrogates". The system could maintain internally-generated surrogates along with the non-unique names in the table, perhaps invisible to the user. Entity joins would be performed on the basis of such surrogates. The question arises, of course, as to how the system knows which surrogate (an old one or a new one) to associate with the non-unique names in a newly input or modified tuple. With an interactive interface, the system might presume that unfamiliar names signify new entities, while the occurrence of previously used names stimulates a dialog to determine which entity is intended.

We might have "environmental qualification", whereby something in the environment of the update can implicitly serve as a qualifier to resolve the ambiguity. E.g., a user may only be allowed to supply information about his own programs, so that program names in his updates are automatically qualified by his user-id. Possible implicit clues include: user ID, program ID, terminal ID, view name, geographic location, time of day, something in the user profile, the access path used to arrive at the modified record or the insert point, and so on. (Interactive dialog is, in effect, a form of environmental qualification.) Such clues might be recorded automatically in an adjacent column for use in a compound key, or they might guide the system in the choice of secret surrogates.

6 COMPARISON OF THE JOINS

We demonstrate here the consequences of violating each of the naming constraints, and describe the different behaviors of the symbolic and entity joins in each case.

6.1 Multiple Names

The symbolic join detects a match only if precisely the same symbol occurs in the fields being compared. Unfortunately, most things have multiple names, sometimes of the same type and sometimes of different types. People can have multiple social security numbers, and they may also be designated by employee numbers, military service numbers, "people" names, and other things. Even numeric quantities have multiple representations, depending on number system, units, etc. In order to match, symbols have to be the same in a very strict sense. Even minor spelling and typographical variations foil the match. Tuples containing any two of the following will fail to be joined: "W. Kent", "W.Kent", "W. KENT", "Kent, W", "Kent, W." (and those are obviously just a very few of the possible variants of my name).

The problem here with the symbolic join is an error of omission: tuples referring to the same entity will fail to be joined. Existing relationships will fail to be detected. The entity join, on the other hand, will detect all references to the same entity, regardless of how named, and generate all the implied relationships.

There are two mechanisms blocking the symbolic join when entities have multiple names. When the names are of the same type (e.g., multiple social security numbers), then the symbolic join will simply fail to detect a match. When the names are of different types (e.g., a social security number and an employee number), then the operation is often rendered illegal by the strict domain rule mentioned earlier.

There are ways of dealing with multiply-named entities using the symbolic join. Additional tables can be provided to specify name equivalences, and chains of symbolic joins can correlate tables in which unlike names can occur. For example, suppose we had a table of employees (EMPS, keyed on employee numbers EN) and a table of military personnel (MILS, keyed on military service numbers MN), and we needed to correlate information on people occurring in both tables. If we had a table EN-MN which contained the corresponding identifier pairs (employee number with military service number), we could perform the chain of symbolic joins

EMPS s-join EN-MN s-join MILS

to correlate employees with military personnel. (The join operator would specify the columns on which to base each match, but these are omitted here to simplify the illustrations.)

This approach becomes cumbersome in several respects. The requisite table structures can become rather elaborate, the tables themselves need to be maintained and become error-prone, users have to know the structures, users need to devise the appropriate chains of joins for correlating various tables, and the performance implications of such multiple joins can become quite significant. And, finally, tables are not the best representation for all forms of multiple representation.

For one thing, the name equivalences can sometimes be embedded in other tables, and sometimes not. If all military personnel are employees, then military service numbers can simply be included in the employee table - provided we are allowed to have null entries for non-military employees. If all employees are in the military, then employee numbers could be included in the military service records - with similar proviso. If both conditions hold (all employees are in the military and vice versa) then the correspondence might arbitrarily be included in either the employee or the military table - or in both (without, incidentally, violating any normalization rules) - or in neither. If neither condition holds, then the name correspondences should be kept in a third, separate EN-MN table, with attendant maintenance problems, e.g., synchronizations of insertions and deletions. Users must know the configuration in order to correctly specify the symbolic join chains between the employee and military tables.

Secondly, when more than two types of names are applicable to entities, questions of redundancy must be addressed. If, for example, entities might be designated by employee number (EN), military service number (MN), or social security number (SN), then we might elect to maintain tables for each possible pair: EN-MN, EN-SN, and MN-SN. Then any two tables involving such identifiers can be joined with at most one intermediate name correspondence table. But these three tables may be redundant - if each identifier type covers the same set of entities - with the usual attendant penalties of space, maintenance effort, and potential inconsistency. Any one of them could be inferred from the other two, by a join. Thus, for example, the table EN-MN could be omitted; users wishing to correlate employees with military personnel could execute the chain of symbolic joins

EMPS s-join EN-SN s-join MN-SN s-join MILS.

Again, users need to know which sets of name tables are being maintained in order to correctly specify the join chains. And if an administrator decides it is more efficient (perhaps based on usage frequency) to install the EN-MN table and drop one of the others, then users must rewrite all such join chains.

A third point is that the internal structure of some name correspondence tables can get surprisingly complex. This arises when entities might have multiple names of the same type, as with social security numbers. In order to correctly join two tables in which people might be referenced by different SN's, it is necessary for the SN-SN name correspondence table to include every permutation of equivalent SN-SN pairs -- even the pairing of each SN with itself. If one person holds SN's 100 and 200, and another person only holds SN 300, the SN-SN table must still include all the following pairs:

100	100
100	200
200	100
200	200
300	300

In general, for each person holding n SN's, the table must include n entries (even for n=1), in order for

TABLE1 s-join SN-SN s-join TABLE2

to match all people occurring in both TABLE1 and TABLE2 under possibly different SN's.

We should note now that for the present example, where people might be identified by employee numbers, military service numbers, or multiple social security numbers, we are dealing with potentially four name correspondence tables:

EN-MN
EN-SN
MN-SN
SN-SN

The analysis of conditions under which any of these tables might be omitted is not simple, involving considerations of (1) the scope of applicability of the identifiers, and (2) the storage, performance, and maintenance tradeoffs between maintaining fewer tables and executing fewer joins. Furthermore, one has to decide whether the EN-SN and MN-SN tables should contain one entry per person (i.e., with just one selected SN) or one entry per SN. Such a decision again affects the choice of join chains in certain cases.

Also, as mentioned earlier, we would rather have algorithmic procedures than tables for certain name equivalences (people's names, numeric conversions).

The entity join avoids such complexity. While an implementation of the entity join could involve such tables and/or procedures, they will be hidden from the user. He need not be aware of their existence or structure, he need not be involved in their maintenance, and he need not specify such intermediate chains of joins. To correlate employees with military personnel, it is sufficient to execute

EMPS e-join MILS

(specifying that the EN field of EMPS is to be correlated with the MN field of MILS).

6.2 Non-Unique Names

Non-unique names really bring out the name-based nature of the symbolic join. Different things having the same name will be matched with each other. It is quite important for users to be aware of this, and to not misinterpret what such joins might signify. Employee records can be joined with each other on the basis of matching spouse fields, but the size of the resulting table should not be interpreted as evidence of wholesale bigamy. The paired employees aren't (necessarily) married to the same person; their spouses might simply happen to have the same names.

The results of such a join are clearly ambiguous with respect to the entities involved. It is entirely the user's responsibility to know when the same name in different fields refers to the same or different entities. Given ten employees married to "John", we are not sure just how many such spouses are involved.

The ambiguity is resolved by choosing the correct join operator. The symbolic join will connect all employees whose spouses have the same name. The entity join (if spouses are in a well-named domain) will connect all employees having the same spouse.

6.3 Multi-Column Names

The symbolic join may be performed over sets of columns, i.e., multiple columns in one table may be compared with multiple columns in another in order to detect the matches. This is largely permitted in support of compound names, i.e., multi-part or multi-column names which include qualifiers to distinguish among non-unique simple names. Spouse names in an earlier example were non-unique, but a spouse might be uniquely identified by the combination of an employee number together with the corresponding spouse first name. This pair of columns might occur in several tables in which spouses were involved, and a symbolic join to correlate information relating to spouses would match on such pairs of columns.

The problem is that the multi-column symbolic join can be abused, by matching arbitrary sets of columns which were never intended to serve as composite identifiers for single entities. This creates another situation in which the user must carefully understand just what such relationships do or don't signify. For example, a table containing spouse identifiers might be joined with any other table containing employee numbers and (anybody's) first names - perhaps a table including a salesman's employee number and a customer's name. Such a join must not be interpreted as correlating information about "same people". The joined records might include birthdates, occupations, sexes, etc. about either the spouse or the customer, and it's up to the user to keep track of which is which. The joined record represents a purely symbolic connection, perhaps in this case revealing nothing more than the fascinating observation that some salesmen have customers and spouses with the same first name.

In order to legitimize an entity join over multiple columns, it is necessary that the table definition specify a well-named compound domain over those columns. Multi-column entity joins would only be permitted over sets of columns so defined.

Multiply-named entities might have various kinds of compound names. The same entity might have single-column names, two-column names, three-column names, and so on. Hence the entity join is not restricted to having the same number of columns specified for each operand. The entity join could match, say, three-column names in one table with single-column names in another, so long as the corresponding domains were sub-domains of a well-named domain, i.e., there existed some mechanism for determining the name correspondence.

7 OVERLAPPING DOMAINS

The strict domain rule inhibits some useful operations from being performed, as when an entity might belong to more than one type. For example, an installation might be maintaining two tables: one with a key from a domain of automobile identifiers, recording certain information about automobiles; and another with a key from a domain of vehicle identifiers, recording other facts about all sorts of vehicles - including automobiles. The strict domain restriction prevents the correlation of the two kinds of facts about automobiles. The symbolic join is not permitted to match a column from the domain of automobile identifiers with a column from the domain of vehicle identifiers. The rule interferes with useful correlations whenever entities might belong to multiple domains (i.e., when domains might overlap). Thus desserts can't be compared with foods, employees can't be matched with stockholders, U.S. presidential elections can't be matched with elections in general, and so on.

We can interpret the effects of the strict domain rule in terms of virtual names, consisting of an actual name qualified implicitly by the domain name, e.g., "EMPLOYEE.12345". By this mechanism we can perceive the strict domain rule as an attempt to convert non-unique names into unique names. That is, although 12345 may not be unique, the virtual names EMPLOYEE.12345, MACHINE.12345, and ORDER.12345 could be globally unique. In effect, the strict domain rule says that columns from different domains will always contain different virtual names, by definition, and hence there is no point in even permitting the comparison.

The failure to deal with overlapping domains can now also be explained as a multiple naming problem. An entity will acquire a different virtual name from each of its domains (e.g., AUTOMOBILE.12345 and VEHICLE.12345), which is why the symbolic join can't recognize the match. It can never detect that different names might refer to the same entity.

It's also worth noting that such qualification by domain name isn't always left implicit. It sometimes has to be made explicit, when applications are forced to deal with domains which are the unions of other domains. The sub-domain names are then explicitly maintained in a separate column, and the combination of name plus sub-domain is relied on for uniqueness.

Let us consider the following example at some length, since it illustrates a number of points. We have an installation maintaining information in three separate tables about programs, files, and terminals, with three corresponding domains for each of the keys. Names are unique within any one domain, but not across domains. Thus there can't be two programs named PAYROLL, but there can be a program and a file with that name.

Suppose now we wish to add an authorization table to indicate which users may use which programs, files, and terminals (collectively referred to as "resources"). In concept this would be a two-column table, containing user names and resource names. But since names are no longer unique in the larger domain of resources, a third column identifying the resource type (subdomain) must be included:

AUTH-TABLE
USER	RESOURCE	TYPE
Smith	payroll	program
Smith	payroll	file
...	...	...

Resources are now identified by the compound key of resource + type. What in the other tables is implied as a virtual name (e.g., PROGRAM.PAYROLL) is now formalized as an explicit compound name.

This example demonstrates that names are often effectively qualified by their domains. Most often it is implicit (as a virtual name) in the application of the strict domain rule on joins. But applications do have to deal with it explicitly, in the form of compound names, whenever they are forced to deal with unions of domains.

This example also affords us another opportunity to illustrate some difficulties with the symbolic join. How can we correlate authorization information in the AUTH-TABLE with other information, say, about programs in the PGM-TABLE? If we simply join the two by matching the key field of PGM-TABLE with the RESOURCE field of AUTH-TABLE, we will get misleading results: file and terminal names in AUTH-TABLE would also be matched with program names. The symbolic join can't compare two columns with one, hence the TYPE field of AUTH-TABLE could not participate in the join to solve this problem. A resourceful programmer might first construct an intermediate table by the relational select operation, taking only those rows of AUTH-TABLE where TYPE = PROGRAM, and joining the result with PGM-TABLE to get the desired correlation. This is fine if the strict domain rule is not in effect. But if it is, then the whole exercise is futile. Under that rule, a symbolic join cannot compare a column from the PROGRAM domain with a column from the RESOURCE domain, and hence the desired join can't be performed at all. Thus, depending on which interpretation (implementation) of the relational model was involved, it would be either difficult or impossible to correlate authorization information with program information using the symbolic join.

It is somewhat instructive to examine the melange of names (explicit, virtual, and compound) that come to bear in this scenario. We use the notation "domain.name" to signify a virtual name including the implicit domain qualifier, and "name1+name2" to signify an explicit compound name taken from two columns.

In PGM-TABLE, the payroll program has the names:

PAYROLL (a unique explicit name in the PROGRAM domain).
PROGRAM.PAYROLL (a virtual name).

In AUTH-TABLE, that same program has the names:

PAYROLL (an ambiguous explicit name in the RESOURCE domain).
RESOURCE.PAYROLL (a virtual name).
PROGRAM+PAYROLL (a compound name).
RESOURCE.PAYROLL+TYPE.PROGRAM (a compound virtual name).

It's easy to see how naming dilemmas can occur in this context.

The entity join avoids such entanglements. So long as PROGRAM and RESOURCE are sub-domains of a well-named domain, an entity join could be performed directly on the two tables to achieve the desired correlation.

8 OTHER OPERATIONS

The distinction between entities and representations suggests the need for other operations as well. While the relational model (at least most versions of it) forbids duplication of rows in a table, this restriction only applies to representations. Several rows in a valid table may refer to the same array of entities if different synonyms are used to name the entities. In a table identifying people by social security numbers, there can be a distinct entry for each social security number held by a given person. It would be interesting to try to design an operator which detects such duplication and perhaps eliminates it; there would have to be some criterion for deciding which synonymous name to retain.

With surrogates, we could contemplate the prospect of nameless entities, at least some having no externally known names of their own. They could still be joined upon for the basis of correlations of other things; access could pass through them after identification of related things. Functional dependences defined on them might give a closer approximation to semantic realities. (For example, consider elections, which in popular examples exist as phantom entities behind tables containing identification only of the dates and winners of those elections.)

The entity join also suggests a natural way to deal with null values. In systems allowing null values, the question arises whether null-valued fields should be considered as matching under a join. We offer a simple answer. Null should be interpreted as signifying the absence of any entity, hence there is no common entity involved, hence the entity join should not match the fields. But null occurring in both fields can be interpreted as being the same symbol, or representation, and hence the symbolic join should match the two fields.

9 CONCLUSION

The relational model, with its symbolic join, works very well for correlating information about entities which are entirely in one domain and which have single, unique, simple names. But for all other situations the user must navigate carefully in dealing with synonyms, ambiguity, and structured names. The entity join takes responsibility for managing such phenomena, and reflects their behavior in the model. The two joins together provide a model with balanced, rational capabilities for both symbol matching and entity matching.

10 REFERENCES

[1] M.M. Astrahan et al., "System R: Relational Approach to Database Management", ACM Transactions on Database Systems 1 (2), June 1976, pp. 97-137.

[2] E.F. Codd, "A Relational Model of Data for Large Shared Data Banks", Comm. ACM 13 (6), June 1970, pp. 377-387.

[3] E.F. Codd, "Relational Completeness of Data Base Sublanguages", in R. Rustin (ed.), Data Base Systems (Courant Computer Science Symposia 6), Prentice-Hall, 1972.

[4] E.F. Codd, "Extending the Data Base Relational Model to Capture More Meaning", ACM SIGMOD International Conference on Management of Data, May 31-June 1, 1979, Boston, Mass.

[5] C.J. Date, An Introduction to Database Systems (second edition), Addison-Wesley, 1977.

[6] P.A.V. Hall, J. Owlett and S.J.P. Todd, "Relations and Entities", in G.M. Nijssen, Modelling in Data Base Management Systems, North Holland, 1976.