The Semantics of Object Identity

William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California

June 1992



> 1 THE SEMANTICS OF IDENTITY . . . 2
> 2 OBJECT IDENTIFIERS . . . 2
> 3 WHAT THERE IS TO IDENTIFY . . . 3
>> 3.1 The Single-interface Premise . . . 4
>> 3.2 Entity And Proxy Objects . . . 4
>> 3.3 Meta-objects . . . 5
>> 3.4 How Objects Become Known . . . 5
> 4 TYPE MEMBERSHIP MODES . . . 6
>> 4.1 Fixed . . . 6
>> 4.2 Assertable . . . 6
>> 4.3 Derived . . . 6
>> 4.4 Covered Or Partitioned . . . 7
>> 4.5 Producer . . . 7
>> 4.6 Supertype . . . 8
> 5 REFERENCES . . . 8


1 THE SEMANTICS OF IDENTITY

If we do something to an object, and then do something else to it, how do we know we did those things to the same object? That depends first of all on what we mean by "the same object". It certainly doesn't mean similar objects, though that's what identical twins are. And the question isn't whether two objects are the same; if we know they are two objects, we know they aren't one and the same object.

The question comes down to a matter of reference, a term we will use for now without further definition. If x and y refer to objects, how do we know they refer to the same object?

Let's write x==y when x is identical to y, i.e., they refer to the same object. How x==y is determined could be considered a matter of implementation, but == can't be a totally arbitrary predicate. There are some behavioral semantics we expect of identity. First, if they are referring to the same object, then any operator should behave the same for both references: f(x) and f(y) must have the same result (including returned values, exceptions, and side effects) for any operator f. Secondly, they refer to one object, which should be reflected when duplicates are eliminated, as in the construction of sets. To put it elegantly [ref ICL]:

(x==y) == ({x,y} == {x}).

Note here that x==y is defined for literal values as well, in this case Boolean truth values.

Other underlying assumptions need to be reaffirmed, so we don't inadvertently violate them when designing integration paradigms. Identity is consistent; objects which are distinct as instances of subtypes can't be the same object as an instance of a supertype. And we don't adopt ambivalent viewpoints of the application domain. If the sales clerk is also the night watchman, some observers might insist that these are still two employees. We can't model the two employee objects as being the same as one person object; we need to be clear and consistent regarding the number of objects we are talking about. It doesn't matter whether the person has the same or different employee numbers, either with the same or different employers. If we model all the concepts by saying that there are two employee objects and one person object, with some sort of correspondence among them, then Employee can't be a subtype of Person -- that would mean the employee objects are person objects. While there may be some other correspondence between these types, they are not related as subtype and supertype. A supertype includes the union of its subtypes, in the sense of a set-theoretical union of the same members that exist in the subtypes. If we do want Employee to be a subtype of Person, then we must model that person as one object, even it he has two jobs -- and even it he has two employee numbers!

2 OBJECT IDENTIFIERS

Although identify can be abstractly modeled in terms of an identify predicate, most object systems support identity simply via comparison of certain values, designated object identifiers (oid's). A reference can then be defined as any expression which evaluates to an oid.

An oid must always be unique within the scope of the system managing it, i.e., two objects cannot be represented by the same oid within the system. In the strictest formulation, an oid is immutable (an object is represented by the same oid throughout its lifetime) and singular (two oid's cannot represent the same object). Two references are to the same object if and only if their oid values are equal.

Immutability can be relaxed so long as an object is represented everywhere in a given system by the same oid at any point in time. Changing an object's oid requires system-wide oid replacement: replacing all occurrences of the old oid in the system with the new one. This may be feasible if it is known that little or no data is stored using the oid. System-wide oid replacement also allows identity merging, whereby two objects become one. The system will, however, have no recollection that these were once distinct objects.

Singularity could be relaxed to allow an object to be represented by several oid's (synonyms), provided that oid equivalence is strictly observed, as described above. That is, all functions must behave the same when applied to any of the equivalent oid's, and all but one must be eliminated from any context which excludes duplicates. Object equality can no longer be based on simple oid value matching, but must instead involve an equivalence predicate. This typically means that a table of synonymous oid's has to be maintained somewhere.

Literal values can be thought of as objects whose oid's contain recognizable representations of themselves. Non-literal objects typically have arbitrary oid's, assigned when the objects are created. Literals and non-literals need some consistent form of identification, since both can occur as arguments and results of functions.

Literal values in different data types often have the same representations, and may also coincide with arbitrary system-generated identifiers. An object identifier thus needs to include some sort of qualifier to differentiate the various data types and the system-generated identifiers. Hence an object identifier is logically decomposable into prefix and suffix components; no specific format or implementation is implied. The prefixes don't necessarily even have to be physically encoded in the identifier, provided they can be reliably deduced from the context in which they are used.

For literals, the prefix differentiates the data types, while the suffix contains the actual representation. For non-literals, the prefix primarily establishes that these are not literals; their prefixes could be further refined to differentiate oid's generated in different databases, and also to label property-based identifiers].

Fixed-length implementations of oid's create some difficulties. It limits the set of literals which can actually be represented. It complicates the identification of aggregates such as sets, whose oid's should logically include the oid's of their members. It can make it difficult to use oid's generated in other object systems. And it can limit the values which can be used for property-based identifiers.

Object identifiers can be system-assigned or algorithmic. A system-assigned identifier is an arbitrary suffix value assigned to an object at the time it is created; uniqueness is typically guaranteed by some centralized agency in the system. The difficulty with system-assigned identifiers in a multi-database system is that there is no such central agency guaranteeing system-wide uniqueness.

Algorithmic identifiers are derivable in terms of some known characteristics of the object. They do not require a creation event to trigger assignment of an identifier, although algorithmic identifiers could also be used when an object is created. Literals essentially have algorithmic identifiers, whereby the suffixes are defined by the representations of the literal values. Most other algorithmic identifiers are property-based.

3 WHAT THERE IS TO IDENTIFY

We need to be clear on what objects there are, so that we don't get them mixed up with each other.

3.1 The Single-interface Premise

A multi-database system is presented to its users via a single system interface, managed by a home database management system (dbms), through which all information made available by the underlying systems (external data sources) is provided in a uniform way. In principle, ignoring cardinality concerns, it is conceptually possible to enumerate all the objects knated, and may remain known to the home system long after they are externally destroyed.

3.2 Entity And Proxy Objects

Multi-database systems force us to recognize a difference between what we consider to be the same object and what the system is treating as the same object. [ref brinmod]

Databases contain proxy objects to represent the entity objects we think about. Single databases are carefully designed to maintain a one-to-one correspondence between these, so we really don't have to think much about the difference. In a given database, there's only one thing (object, tuple, record, or whatever) serving as proxy for a given employee; it can only have one birthdate and one salary, and no two of them can have the same employee number or social security number. Thus the proxy object has the look and feel of the entity object.

Not so in multiple databases. An employee might have different birthdates or salaries in different databases, and a given social security number might be paired with different employee numbers in different databases. The appropriate constraints simply aren't enforced across multiple databases. Anomalies arise if we try to insist that these are all the same object. While we may think of them as the same entity, the system is treating them as distinct proxy objects.

The administrator's job is to create a coherent view for the end user which again contains only one proxy object for each entity object. This proxy may be a new one, distinct from the others, which don't even appear in the end user's view. We need to preserve the identities of all these proxy objects, and relate them appropriately to each other.

Things like employees - and people, ships, cars, stars, departments, companies - are naturally discrete, giving rise to consistent populations in different databases. That is, though different databases might contain different instances, possibly overlapping, they all seem to contain subsets of the same coherent set.

A more extreme problem arises when we don't exactly have even the same sorts of populations in the different databases, giving rise to a form of domain mismatch. Some things are relatively amorphous, being arbitrarily partitioned in different ways in different databases. The jobs in one database might include secretaries and file clerks, while another database has administrative assistants, typists, and receptionists. Colors in one database might include red, pink, white, and blue, while another might have coral, carmine, scarlet, blue, and aqua. The same might happen with chemical compounds, medicines, illnesses, skills, school subjects, courses, news categories, and many other things. Populations of such things can be inconsistent across databases, not being subsets of one coherent set. Correspondences among such instances in different databases are not simple or obvious.

The problem here is how to show a consistent set of proxy objects to an end user, and how these and all the underlying proxies should be identified and related to each other. In a typical solution, the administrator creates an arbitrary set of instances for the end-user's view, and then define mappings to these from the underlying objects.

3.3 Meta-objects

Hypothesis: the type-group mechanisms can be extended to deal with schema mismatch, if the unifying and/or underlying types are allowed to be sets of types or functions rather then sets of ordinary objects. These types would then be subtypes of Type or Function.

Consider the jobs example. If Engineer, Programmer, etc. are to be treated as types at the unifying level (or in one of the external data sources), then we want T0 (or Ti) to be a set of such types. This set might be called JobTypes. It's tricky to keep the notions straight; type graphs should be carefully drawn with different kinds of arcs for subtypes and instances. JobTypes is a subtype of Type; their instances are types. Engineer is an instance of both JobTypes and Type; it is not a subtype of these. Sam might be an instance of Engineer; that doesn't make him an instance of JobTypes or Type.

Things that get tricky:

Much interesting investigation left here.

3.4 How Objects Become Known

The common paradigm for generating object identifiers is to have the system assign one at the time an object is created. This implies that the system knows when an object is created - which isn't always the case. Literals, for example, are never created; there are no creation events at which the system assigns identifiers to individual literals. One could hypothesize a mythical genesis at which all literals were created, but such mythical creation events serve little practical purpose. We thus have two modes by which objects become known to the system: they either have an eternal existence, or they are created by explicit creation requests to the system. Eternal objects have algorithmic identifiers. The identity of a created object is inherently tied up with its creation event: each creation event creates a distinct object, distinct from any other object. The system should be prepared to provide an identifier unique to the particular creation event, though under certain conditions such an identifier could be based on other information known about the object at the time of its creation.

A third mode of existence which has received little attention in single systems becomes important in multi-database systems. A derived object exists only while some associated rule is satisfied. Such objects could exist in discontinuous time periods as the rule periodically becomes true or false. Hence a distinct creation event cannot be associated with a derived object, and its identifier must necessarily be algorithmic, based on information related to the existence rule.

Constructed aggregates are one example. If there is a set constructor {}, then the set {x,y} exists exactly whenever the objects x and y exist. A mention of such a set does not constitute creation; like literals, such objects may be mentioned many times without any definite sense that one such mention is a creation event. In theory, since the identity of such a set is determined by its in

The concepts of scripted and assertable functions are useful for characterizing unifying mappings, and also for describing membership modes for underlying and unifying types.

A scripted function has some specified paradigm for computing its results. In OSQL, this includes derived, foreign (external) and built-in (system) functions. A non-scripted function necessarily has to be implemented as stored data. (Scripted functions could also be implemented with stored data, in the sense of being pre-computed or cached.)

An assertable (updatable) function is allowed to occur on the left of an assignment, e.g., f(x):=y. The semantic intent is that subsequent invocation of f(x) should return y. Non-scripted functions necessarily have to be assertable; there is no other way to define their results. Scripted functions can be assertable if either (a) the inverse paradigm can be deduced by the system, or (b) an update entry point is specified as part of the function definition. Such an entry point should specify whatever is required so that subsequent invocation of f(x) will return y.

For bag-valued functions it is convenient to define f(x)+=y and f(x)=y as assigning to f(x) a bag containing one more or one less occurrence of y than the prior value of f(x).

An initializable function f (proposed for OSQL) is a special case of an assertable function such that a value for f(x) can be asserted at the time x becomes an instance of the argument type of f but not updated thereafter. Constants might then be thought of as preset initializable functions.

A total function f defined on an argument type T must have a non-null value for each instance of T. A value must be provided for f(x) whenever x becomes an instance of T, though it may be updated later, but not to a null value.

4 TYPE MEMBERSHIP MODES

Types can be classified according to the means by which their membership is established. This can be described in terms of the Instances function, which enumerates the instances of a type. It is of no concern here that the enumeration might be infinite.

4.1 Fixed

A fixed type T always has the same instances; Instances(T) is constant. Literal types are a typical example. A proposed extension to OSQL would support initialized types, whose instances can be enumerated or created when the type is created, but not thereafter. Instances(T) would then be an initializable function. Examples include the days of the week, months of the year, states of the US, planets of the solar system, bones of the body, the chemical elements.

4.2 Assertable

Membership in an assertable type T is established by assertion, i.e., Instances(T) is assertable. Objects can be added to or removed from assertable types by explicit request. Adding or removing an object x for an assertable type T is equivalent to Instances(T)+=x or Instances(T)=x. Objects can only be created explicitly as instances of assertable or initialized types, which is equivalent to adding the object to the type immediately after creation of the object.

4.3 Derived

A derived type T (another proposed extension to OSQL) has its membership defined by a rule which selects from pre-existing objects. It is defined with a set-valued expression (e.g., a query) S with the intent that each siS is an instance of T. In effect, Instances(T) is a scripted function having S as its script (since the result must be a set, defining queries always implicitly include the Distinct keyword). To the extent that it can be determined that all instances of S are of type T', T is a subtype of T'.

Example:

Type T1 consists of students under the age of 18. Type T2 consists of the social security numbers actually held by students. The instances of T2 are the integer social security numbers themselves. (Such subtypes of literals are not yet supported in OSQL.) A social security number is an instance of T2 only so long as there is a student having that social security number.

The predicate which defines S is a necessary and sufficient membership condition. Hence subtypes and supertypes of T are constrained by the same condition: instances of subtypes of T must satisfy the condition, and anything satisfying the condition must be acceptable in a supertype of T.

4.4 Covered Or Partitioned

A covered type T is a special case of a derived type whose instances are defined to be any instance of a set of covering types T1,..,Tn, which become subtypes of T. In effect, the defining set S is simply the union of the covering types. Since S establishes a necessary and sufficient membership condition, every instance of T must be an instance of a covering type, and T has no immediate instances of its own. A covered type is sometimes also called an abstract type. A partitioned type is a covered type whose covering types are mutually disjoint.

4.5 Producer

The purpose of a producer type T is to define the existence of derived objects. It is very similar to a derived type, except that it generates objects rather than selecting from existing objects. The generated objects are not necessarily new, i.e., materializing the instances does not constitute an object creation event. Instances of the defining set S are not instances of T, but they provide a basis for identifying instances of T. There is a one-to-one correspondence between S and T. An instance ti of T behaves as though its identifier was T.si, i.e., constructed by concatenating si with a distinguishing prefix associated with T. In effect, Instances(T) behaves approximately like a scripted function whose script is

Instances(T) ::= Select concat(T,x) where x in S.

This paradigm defines both the existence and identity of instances of T. Example:

Create type T3 as produced-by Select ssnum(x) for each Student x;

The produced-by keyword means that the instances of T3 are not the results of the query (as would be the case if this were a derived type). The instances of T3 are not social security numbers; they are not integers. Instead, the instances of T3 are objects whose identifiers behave as though they were constructed by concatenating a social security number with a tag associated with T3 (in the absence of an explicitly specified tag, the default tag could be the oid of T3). Such objects exist only for the social security numbers returned by the query, i.e., an instance of T3 exists exactly for each student having a social security number (we should exclude nulls).

Explicitly specified tags allow several ptypes could also be defined to be assertable, subject to certain constraints. (Fixed types can't be assertable, by definition.) If an object x is asserted to be of a derived or producer type T, then some mechanism must be provided to insure the existence of a corresponding s in the defining set S. Instances(T) becomes, in effect, an assertable scripted function. Either the system must be able to figure out a way to adjust S, or an explicit update entry point must be provided with the definition of T. If T is a covered type and x is not an instance of a covering type, then x would have to be added to one of the covering types. Conversely, asserting the removal of an object x requires an adjustment so that there no longer exists a corresponding s in S.

An assertable producer type T may require more complex management of oid's. An object x already having an oid of a form other than T.si might be added to the type T, but x must nonetheless be recognized as equivalent to T.si for some si in S, even if such an si has to be added to S for this purpose. Either the current oid of x has to be replaced by T.si everywhere in the system, or the two oid's have to be recognized as equivalent. [In other words, it's going to be very tricky to insert objects into an imported type.]

4.6 Supertype

A supertype is any type which has a subtype, and its instances include the instances of its subtypes. Hence the instances of a subtype must satisfy constraints on the supertype.

5 REFERENCES

  1. R. Ahmed, P. DeSmedt, W. Kent, M. Ketabchi, W. Litwin, A. Rafii, M.-C. Shan, "Pegasus: A System for Seamless Integration of Heterogeneous Information Sources", Proc. IEEE COMPCON, March 1991, San Francisco, Calif.
  2. [Blaustein] Barbara Blaustein, "On Object Identifiers", Computer Corporation of America, draft dated 5/17/88 [PUBLISHED?].
  3. Jan Chomicki and Witold Litwin, "Declarativeness of OO Multidatabase Mappings", in preparation.
  4. Umesh Dayal and Hai-Yann Hwang, "View Definition and Generalization for Database Integration in a Multidatabase System", IEEE Trans. on Software Eng., SE-10(6), Nov. 1984.
  5. [De] Linda DeMichiel, "Resolving database incompatibility: an approach to perform relational operations over mismatched domains", IEEE Trans. on Knowledge and Data Eng., 1(4), Dec. 1989, pp. 485-493.
  6. [Fi87] Fishman, D.H. et al, "Iris: An Object-Oriented Database Management System", ACM Transactions on Office Information Systems, Volume 5 Number 1, January 1987. Also in [ZM1].
  7. [Fi89] Dan Fishman, et al, "Overview of the Iris DBMS", Object-Oriented Concepts, Databases, and Applications, Kim and Lochovsky, eds, Addison-Wesley, 1989.
  8. [Heiler] Sandra Heiler, "Object Identity and Identifiers in Federated Systems", Xerox Advanced Information Technology, [PUBLISHED?].
  9. Sandra Heiler and Barbara Blaustein, "Generating and Manipulating Identifiers for Heterogeneous, Distributed Objects", Proc Third Int'l Workshop on Persistent Object Systems, 10-13 Jan 1989, Newcastle, Australia.
  10. William Kent, "A Rigorous Model of Object Reference, Identity, and Existence", Journal of Object-Oriented Programming 4(3) June 1991 pp. 28-38. [html]
  11. William Kent, "Solving Domain Mismatch and Schema Mismatch Problems With an Object-Oriented Database Programming Language", Proc. 17th Intl. Conf. on Very Large Data Bases, Sept. 3-6, 1991, Barcelona, Spain. [pdf]
  12. William Kent, "Variable-Prefix Identifiers", HPL-92-50, Hewlett-Packard Laboratories, April 1992. [html]
  13. William Kent, "The Breakdown of the Information Model in Multi-Database Systems", SIGMOD Record 20(4) Dec. 1991. [html]
  14. William Kent, Mohammad Ketabchi, Ravi Krishnamurthy, Witold Litwin, Ming-Chien Shan, "On Scoping, Naming, and Overloading in Heterogeneous OODBMS", in preparation. [pdf]
  15. [KC] Setrag Khoshafian and George Copeland, "Object Identity", Proc. Conf. on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), Portland, Oregon, 1986.
  16. Ravi Krishnamurthy, Witold Litwin and William Kent, "Language Features for Interoperability of Databases with Schematic Discrepancies", Proc ACM SIGMOD Int'l Conf on Mgmt of Data, Denver, Colorado, May 29-31 1991.
  17. Erich J. Neuhold, William Kent, and Ming-Chien Shan, "Object Identification in Interoperable Systems", First International Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991.
  18. [Shan89] Shan, M-C., "Unified Access in a Heterogeneous Information Environment", IEEE Office Knowledge Engineering, 3(2), 1989.