William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California
1990-1992
> 1 USER-SPECIFIED ATTRIBUTES AS OID'S [May
1990] . . . 1
> 2 CATEGORY TYPES AND USER-DEFINED OBJECT IDENTIFIERS
[1991] . . . 3
>> 2.1 Motivation . . . 4
>> 2.2 Semantics . . . 4
>> 2.3 Category Types . . . 5
>> 2.4 OID Formats . . . 6
>> 2.5 Conclusions . . . 7
>> 2.6 Notes . . . 7
> 3 THE SEMANTICS OF OBJECT IDENTITY [May 1992] . . .
7
It is sometimes proposed that object systems allow user-specified attributes such as employee number, part number, etc., to be used in place of system-generated oid's. (I don't remember whether any specific models or implementations support this.)
Advantages:
Disadvantages:
Discussion:
The fact that a user attribute is serving as an oid is a matter of implementation, and should not alter the semantics of Iris functions. The specification that a user attribute is to be so used should be declared separately from function definition, just as we now do for clustering and index creation. Thus we might have a function created as
create function EmpNo(Employee)->Integer key;
(or it might be created in-line with the Employee type definition). Separately there would be a specification like
oid EmpNo;
implying that EmpNo is to be used as the oid for Employees. This specification should occur somewhere near the clustering and index specifications.
This implies a new form of function behavior specification as well. Instead of being stored, derived, or foreign, the EmpNo function is implemented by extracting data from an oid. If :bob contains the oid of employee Bob, then EmpNo(:bob) is still meaningful. It just happens to be executable by extraction from the oid, without needing any table lookup.
Several constraints are imposed on the EmpNo function:
More significant constraints are imposed on the Employee type:
Note, however, that two such types can have common supertypes. We might, for example, have user attributes as oid's for parts and documents, with assets as a supertype of both. A function defined on assets might have oid's of both kinds for its arguments. Some assets might even have system-generated oid's, if there was another subtype without user oid's.
Such user attributes are rarely globally unique, even within one database. A part number and a document number might accidentally be the same. Thus, at minimum, a user-specified oid would need an extra field to indicate its "source". More generally, all oid's would have such a field, with the system's oid generator being just one of the possible sources. Thus, to be more precise, the user attribute would be embedded in the oid, rather than serving as the whole oid. Special mappings would be required to deal with columns that only contained the attribute value (i.e., in existing SQL tables).
Oid size becomes an extremely critical issue. To begin with, the size of the "source" field will undoubtedly be fixed, thereby limiting the number of distinct attributes which can be used in one database.
Next, we have to choose between fixed and variable-length oid's. Variable-length oid's don't constrain which attributes the user may choose for oid's, but it creates havoc with the implementation. Some types might have mixed oid lengths, such as assets. Also, we need an assessment of the impact of variable oid lengths on things like query trees, query algorithms, system tables, etc. Variable-length oid's probably also require a length field in the oid, for efficiency, even though it can be determined from the source code.
Fixed-length oid's probably won't completely satisfy anybody. Some users will be prevented from using certain attributes as oid's. At the same time, we are likely to compromise on a fixed length that's larger than we like, probably wasting space in many cases. This could conceivably outweigh the original motivation of not wasting extra space for system-generated unique identifiers. That certainly bears investigation.
The source and oid lengths could be installation options, if we had some way to parameterize these things. Even then, it would be a mess across multiple databases.
Coordinating oid's across multiple databases would require the following:
Some of these problems might arise even if we just wanted to treat key values as oid's in SQL databases, without trying to provide the facility in OSQL. That should also be checked out. Additional problems might also be encountered here if direct SQL access were also allowed, e.g., with referential integrity.
Although the purest object semantics require oids to be information-free and system generated, there is considerable motivation and pressure to allow user-defined oids such as employee numbers and part numbers to serve as oids.
The advantages:
The main disadvantages are substantial impacts on semantics and implementation. Nevertheless, this is an important capability that bears investigation.
The fact that a user attribute is serving as an oid is matter of implementation, and should not alter the semantics of Iris functions. The specification that a user attribute is to be so used should be declared separately from function definition, just as we now do for clustering and index creation. Thus we might have a function created as
create function EmpNo(Employee) -> Integer key;
(or it might be created in-line with the Employee type definition). Separately there would be a specification like
oid EmpNo;
implying that EmpNo is to be used as the oid for employees. This specification should occur somewhere near the clustering and index specifications.
This implies a new form of function behavior specification as well. Instead of being stored, derived, or foreign, the EmpNo function is implemented by extracting data from an oid. If :bob contains the oid of employee Bob, then EmpNo(:bob) is still meaningful. It just happens to be executable by extraction from the oid, without needing any table lookup.
In order to be safe, oids so defined must be unique, singular, stable, etc. This works if such oids are unique within a type, and an object belongs to one and the same one of these types over its lifetime. That's not an unreasonable requirement.
Several constraints are imposed on the EmpNo function:
More significant constraints are imposed on the Employee type:
Let's consider a distinguished set of types called "categories". A category is a type, with the additional property that an object belongs to exactly one and the same category during its lifetime. Thus an object must be created as an instance of a category, and a category cannot be added to an object later. (That's one of the few ways in which a category has "less" capability than a type.)
Within a category, any system of unique identifiers could be allowed. It might be system-generated, as current oids are; it might be a user specified total property, such as part number or employee number; or it might be an internal tuple identifier from a "primary relation" corresponding to the category.
The category would have to be encoded in the handle of the object, which potentially puts an upper bound on the number of categories that can be supported.
In the degenerate case, Object itself is a category. A more reasonable split might be between literals and non-literals. The distinctions currently made in Iris handles constitute an effective set of categories in themselves. User-defined types such as people, departments, vehicles, parts, companies, documents, etc. can also serve as categories.
In general, a set of category types must satisfy the following constraints:
Note that categories don't have to occur near the top of the type graph. There can be user-defined supertypes of categories. For example, people and companies can be subtypes of customer; vehicles, parts, and documents might be subtypes of assets. It only means that customers and assets have identifiers from several categories. Objects can also be created in subtypes of categories; the category need not be explicitly mentioned.
This scheme also requires system enforcement of disjointness. The system must ensure that no object is both a person and a document; it doesn't even make sense to allow them to have a common subtype.
Such user attributes are rarely globally unique, even within one database. A part number and a document number might accidentally be the same. Thus, at minimum, a user-specified oid would need an extra field to indicate its ``source''. More generally, all oid's would have such a field, with the system's oid generator being just one of the possible sources. Thus, to be more precise, the user attribute would be embedded in the oid, rather than serving as the whole oid. Special mappings would be required to deal with columns that only contained the attribute value (i.e., in existing SQL tables).
Oid size becomes an extremely critical issue. To begin with, the size of the "source" field will undoubtedly be fixed, thereby limiting the number of distinct attributes which can be used in one database.
Next, we have to choose between fixed and variable-length oid's. Variable-length oid's don't constrain which attributes the user may choose for oid's, but it creates havoc with the implementation. Some types might have mixed oid lengths, such as assets. Also, we need an assessment of the impact of variable oid lengths on things like query trees, query algorithms, system tables, etc. Variable-length oid's probably also require a length field in the oid, for efficiency, even though it can be determined from the source code.
Fixed-length oid's probably won't completely satisfy anybody. Some users will be prevented from using certain attributes as oid's. At the same time, we are likely to compromise on a fixed length that's larger than we like, probably wasting space in many cases. This could conceivably outweigh the original motivation of not wasting extra space for system-generated unique identifiers. That certainly bears investigation.
The source and oid lengths could be installation options, if we had some way to parameterize these things. Even then, it would be a mess across multiple databases.
Coordinating oid's across multiple databases would require the following:
Some of these problems might arise even if we just wanted to treat key values as oid's in SQL databases, without trying to provide the facility in OSQL. That should also be checked out. Additional problems might also be encountered here if direct SQL access were also allowed, e.g., with referential integrity.
This notion of "category" coincides with one of the many definitions of "type" in other object models. For them, what we call a type is a "role". In their terms, a type is something which persistently characterizes an object, while a role is changeable.
Discuss the possible relaxation of the disjointness constraint, giving rise to oid homonyms. It could get messy if we also want objects to be able to change categories.
Semantic prerequisites to merging the identities of x and y:
After merging, duplicates must be eliminated from the values of any set-valued function. Thus, if f(w)={x,y,z}, then after merging, the cardinality of the result must be 2. In particular, duplicates must be eliminated from the extensions of any types to which they belong.
We know of two implementation strategies: global oid replacement and synonymous oid's.
Global oid replacement is logically impossible for instances of producer types, or for instances of any types for which oid's are based on user-specified properties. In other cases, oid replacement is, as Torre observes, comparable with object deletion.
Synonymous oid's seem to be very costly in performance. Any time whatsoever that two oid's are being compared, and they are unequal, it is at least necessary to look aside to see if they might possibly be synonyms. If it is possible, then some algorithm or table lookup has to be performed to determine if they are synonymous.
Incremental global oid replacement incurs much of the same overhead as synonymous oid's.